This document describes the computational analysis of the Chan M, Motakis E, Tan SH et al. Prioritizing Post-Myocardial Infarction Heart Failure Candidates Using Plasma Proteomics and Single-Cell Transcriptomics. Submitted to Circulation, 2020 (under review).
In this study, we employed aptamer-based affinity-capture plasma proteomics to measure 1305 plasma proteins in a New Zealand cohort (CDCS) and a Singapore (IMMACULATE) cohort. We sought to identify circulating proteins associated with an HF incident after an initial MI by performing large-scale proteomic profiling of plasma obtained from patients 30 days after their index MI. The pipeline and methodology have been extensively discussed in the submitted manuscript. Here, we will show in detail the computational steps (R script, tables and plots) that generated the main findings. The reader is free to use our code to replicate the results and to explore other aspects of our data.
A brief description of the datasets used in this study can be found at https://github.com/ArisStefanosSn/HFproteomics.
The primary cohort was selected from the Coronary Disease Cohort Study (CDCS, ACTRN 12605000431628), which consisted of 2140 patients hospitalized in two tertiary hospitals in New Zealand (NZ), for an acute coronary syndrome from 2002 – 2009, and followed up for a median of 5.1 years (interquartile range 3.7-6.8 years, maximum 9.5 years). Clinical data, blood samples and echocardiographic measurements were obtained at ~ 30 days, 4 months and 12 months after hospital admission. Subsequent clinical events and mortality were obtained from the NZ National Health Information System. The New Zealand Multi-region Ethics Committee approved the study (CTY/02/02/018) and all participants gave written informed consent prior to study participation. For the current study, we selected nested cases of 181 post-MI patients who had readmission for HF (HF group) and another 250 post-MI patients who remained free of HF hospitalization, adverse LV remodelling or death from a cardiovascular cause during follow-up (Control group).
The external cohort was selected from the Improving Outcomes in Myocardial Infarction through Reversal of Cardiac Remodelling (IMMACULATE) registry, which consisted of 859 patients hospitalized for MI at 3 hospitals in Singapore from 2011 to 2014 and followed up for a median of 392 days (interquartile range 361-724 days, maximum 1006 days). Clinical data, blood samples and echocardiographic measurements were obtained within 24-72 hours of admission, 30 days, 6, 12 and 24 months after hospital admission. We selected a cohort of the first 200 consecutively enrolled patients with 30-day plasma samples available and therefore had the longest follow-up period at the time of study completion. Of these 200 patients, 19 were hospitalized for HF. The institutional review board and the ethics committee at Singapore’s National Healthcare Group Domain Specific Review Board (DSRB 2013 /00248 and 2013/00635) approved the study protocol and all patients gave written informed consent prior to participation.
The primary clinical endpoint was hospitalization for HF, defined as clinically diagnosed acute HF requiring hospitalization for >24 hours or treatment with diuretics if duration of stay was <24 hours. For each patient we considered a large set of clinical measurements in our subsequent analysis:
Age: the patient’s age at admission.
Gender: (F)emale or (M)ale.
Race: Various ethnicities.
History.of.diabetes: whether the patient has a history of diabetes, (Y)es or (N)o.
History.of.hypertension: whether the patient has a history of hypertension, (Y)es or (N)o.
History.of.hyperlipidemia: whether the patient has a history of hyperlipidemia, (Y)es or (N)o.
Somalogic_Group_Summary: the patient groups, post-MI Heart Failure (incident_HF), post-MI MI (incident_MI), post-MI event free (Control).
Somalogic_Group_Full: Divisins of the above groups into incident_HF patients that showed HF only (HF_First) and incident_HF patients that showed HF and MI together (HF_and_MI_together).
Diagnosis_Dx: whether the patient had STEMI or NSTEMI.
Smoker: whether the patient in currently a smoker (current), an ex-smoker (ex-smoker) or a non-smoker (non).
BMI: the patient’s Body Mass Index.
Hx_Renal: History of renal disease, (Y)es or (N)o.
Hx_MI: History of Myocardial Infarction, (Y)es or (N)o.
Hx_HF: History of Heart Failure, (Y)es or (N)o.
Hx_PTCA: History of Percutaneous Transluminal Coronary Angioplasty, (Y)es or (N)o.
VesselDisease: Existence of vessel disease, one of None, Single, Double and Triple.
ACE1 / ACEI_ARB: Angiotensin Converting Enzyme inhibitors medication, (Y)es or (N)o.
Month_EF: Left Ventricular Ejection Fraction measured at 4 months.
The same variables are considered in both the CDCS and the IMMACULATE cohorts.
Plasma samples (50 \(\mu\)l) collected 30 days after the index event were analysed using a Slow Off-rate Modified Aptamer (SOMAmer)–based capture array called “SOMAscan” (somaLogic, Inc, Boulder, CO, USA). We measured 1,305 human proteins (47% secreted proteins, 28% extracellular domains and 25% intracellular proteins).
We interrogated our published dataset to integrate single cell non-CM transcriptomic data (scRNA-seq) from mice undergoing coronary ligation (MI) versus sham surgery and single nuclei CM transcriptomic data (snRNA-seq) from mice undergoing transverse aortic constriction (TAC) versus sham surgery. For non-CMs, cells were randomly isolated by flow-assisted cell sorting from digested mouse left ventricles 7 days post-MI/sham surgery (N=3 per group). For CMs, nuclei were isolated from digested mouse left ventricles 8 weeks post-TAC/sham surgery (N=5 per group). The cDNA libraries were prepared by SMART-Seq2 protocol and sequenced using HiSeq 2000 (Illumina). Data were further cross-referenced with publicly-available datasets containing single CM transcriptomes from a mouse TAC model and human DCM patients.
We analysed 1,305 plasma proteins measured for 750 patients in the CDCS cohort: 250 Control patients, 124 post-MI patients who had readmission for HF (HF group), 57 post-MI patients who had readmission for HF and MI (HF group), 250 post-MI patients who had readmission for MI (MI group) and 69 with cardiac remodelling asfter the initial MI (remodelled group). In the IMMACULATE cohort we had 330 patients: 107 Healthy Control, 190 post-MI event-free (Control) patients and 33 post-MI HF patients (HF group).
Among the 1,305 proteins are the two forms of proBNP, BNP and NT-proBNP. They are practically the same protein resulting from the cleavage of proBNP into its active and inactive form respectively. For simplicity, in this analysis we will consider and refer to them as two proteins.
# in R run this to set your working directory
# to the folder where HFproteomics is located
# setwd("~/Desktop/HFproteomics")
# source the necessary R packages and functions
source("ProteomicsData/source.R")
################
# CDCS dataset #
################
desi_CDCS <- read.table("ProteomicsData/Somalogic_Design_CDCS.txt", sep = "\t")
exprs_CDCS <- read.table("ProteomicsData/Somalogic_Expression_CDCS.txt", sep = "\t", header = TRUE)
Annot_CDCS <- read.table("ProteomicsData/Somalogic_Annotation_CDCS.txt", sep = "\t", header = TRUE)
# Number of proteins x samples
dim(exprs_CDCS)
## [1] 1305 750
# Number of HF and Control patients
table(desi_CDCS$Somalogic_Group_Full)
##
## Control HF_and_MI_together HF_First MI_First
## 250 57 124 250
## Remodelled
## 69
######################
# IMMACULATE dataset #
######################
desi_IMMACULATE <- read.table("ProteomicsData/Somalogic_Design_IMMACULATE.txt", sep = "\t")
desi_IMMACULATE_ACS <- data.frame(as.matrix(desi_IMMACULATE[desi_IMMACULATE$Somalogic_Group_Full == "ACS", ]))
exprs_IMMACULATE <- read.table("ProteomicsData/Somalogic_Expression_IMMACULATE.txt", sep = "\t", header = TRUE)
Annot_IMMACULATE <- read.table("ProteomicsData/Somalogic_Annotation_IMMACULATE.txt", sep = "\t", header = TRUE)
# Number of proteins x samples
dim(exprs_IMMACULATE)
## [1] 1305 330
# Number of Healthy, HF and Control patients
table(desi_IMMACULATE$HF)
##
## --- N Y
## 107 190 33
Below, we show the frequencies of certain important clinical variables with respect to the disease state at CDCS. Fisher exact test did not reveal any significant differences.
# disease state by ethnic group at CDCS
table(desi_CDCS$Race, desi_CDCS$Somalogic_Group_Full)
##
## Control HF_and_MI_together HF_First MI_First Remodelled
## African 1 0 0 0 0
## Asian 4 1 3 2 0
## Chinese 1 0 0 0 0
## European 7 2 4 7 5
## Fijian 1 0 0 1 0
## Indian 0 0 0 3 1
## Not_stated 31 3 3 8 7
## NZ_European 137 40 74 147 41
## NZ_Maori 4 3 3 12 1
## Other_European 61 6 34 67 13
## Other_Pacific_Island 1 0 0 0 0
## Pacific 2 2 3 3 1
# disease state by gender at CDCS
table(desi_CDCS$Gender, desi_CDCS$Somalogic_Group_Full)
##
## Control HF_and_MI_together HF_First MI_First Remodelled
## F 77 21 38 77 18
## M 173 36 86 173 51
# disease state by history of diabetes at CDCS
table(desi_CDCS$History.of.diabetes, desi_CDCS$Somalogic_Group_Full)
##
## Control HF_and_MI_together HF_First MI_First Remodelled
## N 214 39 83 202 62
## Y 36 18 41 48 7
# disease state by history of hypertension at CDCS (--- means NA)
table(desi_CDCS$History.of.hypertension, desi_CDCS$Somalogic_Group_Full)
##
## Control HF_and_MI_together HF_First MI_First Remodelled
## --- 2 1 4 1 0
## N 147 17 42 90 41
## Y 101 39 78 159 28
# disease state by Diagnosis at CDCS
table(desi_CDCS$Diagnosis_Dx, desi_CDCS$Somalogic_Group_Full)
##
## Control HF_and_MI_together HF_First MI_First Remodelled
## NSTEMI 166 45 92 210 47
## STEMI 84 12 32 40 22
In the same way we generated the frequency tables of the IMMACULATE cohort that contains only males from 3 ethnic groups. The data of the IMMACULATE cohort come in 2 batches, A and B, with a limited number of HF cases.
# disease state by ethnic group at IMMACULATE
table(desi_IMMACULATE_ACS$Race,desi_IMMACULATE_ACS$HF)
##
## N Y
## Chinese 116 19
## Indian 39 5
## Malay 35 9
# disease state by gender at IMMACULATE
table(desi_IMMACULATE_ACS$Gender,desi_IMMACULATE_ACS$HF)
##
## N Y
## M 190 33
# disease state by history of diabetes at IMMACULATE
table(desi_IMMACULATE_ACS$History.of.diabetes,desi_IMMACULATE_ACS$HF)
##
## N Y
## N 158 23
## Y 32 10
# disease state by history of hypertension at IMMACULATE
table(desi_IMMACULATE_ACS$History.of.hypertension,desi_IMMACULATE_ACS$HF)
##
## N Y
## N 112 19
## Y 78 14
# disease state by Diagnosis at IMMACULATE
table(desi_IMMACULATE_ACS$Diagnosis_Dx,desi_IMMACULATE_ACS$HF)
##
## N Y
## NSTEMI 80 8
## STEMI 110 25
# disease state by Batch at IMMACULATE
table(desi_IMMACULATE_ACS$Batch,desi_IMMACULATE_ACS$HF)
##
## N Y
## A 173 27
## B 17 6
Quality control was done on the raw Somascan data. The data was first normalized to remove within-run hybridization variation followed by median normalization across all samples and finally calibrated to eliminate inter-plate and inter-run differences. The acceptance criteria for normalization were 0.4 to 2.5 and calibration scale factor for the SOMAmer within ±0.4 of the median.
The expression profiles of the samples that passed the quality control were log2-transformed and adjusted for a set of confounding factors. For each protein, we used linear modelling to assess any independent effects of the following confounders: sex, age, BMI, smoking status, clinical history (diabetes, hypertension, hyperlipidemia, vessel and renal disease) and medication status (beta-blockers and ACE) on protein abundance. The variables with a false discovery rate <5% in more than 5% of the proteins were fitted in a multiple regression linear model, to minimize potential confounding effects. Due to the small sample size of the African, Chinese, Fijian, Indian and other pacific ethnicities in CDCS, we grouped the respective patients together.
# adjust the CDCS data
rr <- as.character(desi_CDCS$Race)
rr [which(rr == "African" |
rr == "Chinese" |
rr == "Fijian" |
rr == "Other_Pacific_Island" |
rr == "Indian")] <- "OtherRace"
desi_CDCS$Race <- factor(rr)
adj_CDCS <- adjustData_CDCS(Expr = exprs_CDCS, Design = desi_CDCS)
# save the adjusted data
# write.table(adj_CDCS, "ProteomicsData/Adjusted_Expression_CDCS.txt", sep = "\t")
The IMMACULATE cohort consisted of three enthinicites, i.e. Chinese (majority), Malay and Indian. No grouping was performed prior the data adjustment. Here, the adjustment model also accounted for the batch effects:
# adjust the IMMACULATE data
adj_IMMACULATE <- adjustData_IMMACULATE(Expr = exprs_IMMACULATE, Design = desi_IMMACULATE)
# save the adjusted data
# write.table(adj_IMMACULATE, "ProteomicsData/Adjusted_Expression_IMMACULATE.txt", sep = "\t")
The adjusted and log2-transformed values are stored and the samples of interest are extracted for further analysis:
# the adjusted CDCS data
data_CDCS <- CDCS_summary(folder = "ProteomicsData/")
data_CDCS$Expr[1:5, 1:5]
## AK0001 AK0004 AK0019 AK0021 AK0025
## SL019100:STUB1 3.030702 2.717001 3.035798 2.718204 2.834960
## SL007136:CEBPB 3.039809 3.019352 3.164512 3.179514 3.042512
## SL001731:ENO2 4.554199 4.398007 4.370560 4.409599 4.489801
## SL019096:PIAS4 2.871452 2.783088 2.949843 2.971706 2.755966
## SL005173:IL10RA 3.246265 3.229963 3.215740 3.253146 3.082142
dim(data_CDCS$Expr)
## [1] 1305 431
data_CDCS$Design[1:5, ]
## Subject_ID Age Gender Race History.of.diabetes
## AK0001 AK0001 60 F NZ_European N
## AK0004 AK0004 52 F NZ_European N
## AK0019 AK0019 76 M Asian N
## AK0021 AK0021 74 M Other_European N
## AK0025 AK0025 77 F Other_European N
## History.of.hypertension History.of.hyperlipidemia Somalogic_Group_Full
## AK0001 N N Control
## AK0004 N N Control
## AK0019 --- Y HF_First
## AK0021 Y N Control
## AK0025 Y N HF_First
## Somalogic_Group_Summary Diagnosis_Dx TimePoint SiteId Smoker
## AK0001 Control NSTEMI T2 CDCS Ex-smoker
## AK0004 Control NSTEMI T2 CDCS Non
## AK0019 incident_HF NSTEMI T2 CDCS Non
## AK0021 Control NSTEMI T2 CDCS Ex-smoker
## AK0025 incident_HF NSTEMI T2 CDCS Non
## BMI Hx_Renal Hx_MI Hx_HF Hx_PTCA PriStokeTIA VesselDisease
## AK0001 24.63547508 N N N N N None
## AK0004 23.5947692 N N N N N None
## AK0019 31.84713376 N Y Y N N ---
## AK0021 25.5648038 N Y N N N Triple
## AK0025 22.76795005 N N --- N N Double
## Thromb_Rplase Thromb_SK Thromb_TPA Beta_blocker_Discharge ACE1 ACEI_ARB
## AK0001 N N N N N ---
## AK0004 N N N Y N ---
## AK0019 N N N N Y ---
## AK0021 N N N Y Y ---
## AK0025 N N N Y N ---
## change.EDV change.ESV change.EF hsTnT_T2 NTproBNP_T2 hsTnI_T2
## AK0001 31.27413127 4.790419162 11.94029851 --- 38.26 3.6
## AK0004 -20.48076923 -12.36842105 -5.238095238 --- 23.42 2.2
## AK0019 --- --- 54.65116279 --- --- ---
## AK0021 14.50777202 7.418397626 4.795737123 --- 57.3 ---
## AK0025 25.80645161 43.25581395 -8.617886179 --- 380.67 502.7
## Grace.Score FUDays Death_Cause_Any Death_Cause_Cardiovascular Death_Day
## AK0001 --- 3364 N N ---
## AK0004 --- 3353 N N ---
## AK0019 --- 947 Y Y ---
## AK0021 --- 1530 Y N ---
## AK0025 --- 724 Y Y ---
## HF First_HF MI First_MI HStroke First_HStroke IStroke First_IStroke
## AK0001 N 3364 --- --- N 3364 N 3364
## AK0004 N 3353 --- --- N 3353 N 3353
## AK0019 Y 202 --- --- N 947 N 947
## AK0021 N 1530 --- --- N 1530 Y 614
## AK0025 Y 162 --- --- N 724 N 724
## Spironolactone STEMI First_Stemi NSTEMI First_NStemi UNSPECMI
## AK0001 N N 3364 N 3364 N
## AK0004 N N 3353 N 3353 N
## AK0019 N N 947 N 947 N
## AK0021 N N 1530 N 1530 N
## AK0025 N N 724 N 724 N
## First_UnSpecMI Baseline_EF Month_EF
## AK0001 3364 67 75
## AK0004 3353 63 59.7
## AK0019 947 17.2 26.6
## AK0021 1530 56.3 59
## AK0025 724 61.5 56.2
# the adjusted IMMACULATE data
data_IMMACULATE <- IMMACULATE_summary(folder = "ProteomicsData/")
data_IMMACULATE$Expr[1:5, 1:5]
## P014 P053 P054 P058 P060
## SL019100:STUB1 2.451027 2.688015 3.036004 2.726026 2.629335
## SL007136:CEBPB 2.941218 3.277664 2.959143 3.051050 3.080631
## SL001731:ENO2 3.115845 4.241446 4.237186 4.268835 4.132408
## SL019096:PIAS4 2.360218 2.891926 2.770493 2.979254 2.993664
## SL005173:IL10RA 3.132629 3.295121 3.168715 3.232676 3.310879
dim(data_IMMACULATE$Expr)
## [1] 1305 223
data_IMMACULATE$Design[1:5, ]
## Subject_ID Age Gender Race History.of.diabetes History.of.hypertension
## P014 P014 34 M Chinese N N
## P053 P053 54 M Chinese N Y
## P054 P054 46 M Chinese Y Y
## P058 P058 62 M Chinese Y Y
## P060 P060 46 M Chinese N N
## History.of.hyperlipidemia Somalogic_Group_Full Somalogic_Group_Summary
## P014 Y ACS Reverse
## P053 N ACS Adverse
## P054 Y ACS Reverse
## P058 N ACS Reverse
## P060 Y ACS Adverse
## Diagnosis_Dx TimePoint SiteId Smoker BMI Hx_Renal Hx_MI Hx_HF
## P014 STEMI T2 NUH Current 26.13920442 N N N
## P053 STEMI T2 NUH Current 26.62070286 N N N
## P054 STEMI T2 NUH Current 24.4646016 N N N
## P058 STEMI T2 NUH Non 24.5389649 N N N
## P060 STEMI T2 NUH Non 24.8015873 N N N
## Hx_PTCA PriStokeTIA VesselDisease Thromb_Rplase Thromb_SK Thromb_TPA
## P014 N --- Single --- --- ---
## P053 N --- Double --- --- ---
## P054 N --- Single --- --- ---
## P058 N --- Single --- --- ---
## P060 N --- Single --- --- ---
## Beta_blocker_Discharge ACE1 ACEI_ARB change.EDV change.ESV change.EF
## P014 Y --- Y --- --- ---
## P053 Y --- Y --- --- ---
## P054 Y --- --- --- --- ---
## P058 Y --- Y --- --- ---
## P060 Y --- Y --- --- ---
## hsTnT_T2 NTproBNP_T2 hsTnI_T2 Grace.Score FUDays Death_Cause_Any
## P014 --- --- --- --- --- ---
## P053 --- --- --- --- --- ---
## P054 --- --- --- --- --- ---
## P058 --- --- --- --- --- ---
## P060 --- --- --- --- --- ---
## Death_Cause_Cardiovascular Death_Day HF First_HF MI First_MI HStroke
## P014 --- --- Y --- --- --- ---
## P053 --- --- N --- --- --- ---
## P054 --- --- Y --- --- --- ---
## P058 --- --- N --- --- --- ---
## P060 --- --- N --- --- --- ---
## First_HStroke IStroke First_IStroke Spironolactone STEMI First_Stemi
## P014 --- --- --- --- --- ---
## P053 --- --- --- --- --- ---
## P054 --- --- --- --- --- ---
## P058 --- --- --- --- --- ---
## P060 --- --- --- --- --- ---
## NSTEMI First_NStemi UNSPECMI First_UnSpecMI Batch Baseline_EF Month_EF
## P014 --- --- --- --- B 47.2 47.1
## P053 --- --- --- --- B 42 52.9
## P054 --- --- --- --- B 36.4 48.1
## P058 --- --- --- --- B 43.1 44.1
## P060 --- --- --- --- B 47.6 45.1
We use t-Distributed Stochastic neighbor Embedding (tSNE) to contruct a low dimensional embedding of the high dimensional IMMACULATE data and visualise the successfull batch effect minimization. The tSNE analysis used all proteins that passed the quality control.
# tSNE of IMMACULATE (with Batch information)
TSNEplot(Expr = data_IMMACULATE$Expr,
Design = data_IMMACULATE$Design,
Annotation = data_IMMACULATE$Annotation,
cohort = "IMMACULATE",
diseaseColumn = 53)
This section outlines the analytical steps we followed for the analysis of the Somalogic data. It includes the individual pipelines for each cohort and their integration with the Left Ventricular Ejection Fraction (LVEF) measurements.
Before proceeding to the classical differential expression analysis that identifies proteins whose expression levels differ across HF and Control, we run Levene’s test to identify potential proteins whose variances differ significantly across those states. Levene’s test does not make any assumptions oon the data distribution.
#############################
# test for variance in CDCS #
#############################
het_CDCS <- LeveneTest(Expr = data_CDCS$Expr,
Design = data_CDCS$Design,
Annotation = data_CDCS$Annotation,
cohort = "CDCS",
diseaseColumn = 9)
het_CDCS[1:5,]
## Protein LeveneP adj.P.Val
## 130 SL001777:CST3 1.44943373921585e-09 1.63496125783548e-06
## 58 SL008631:DSC2 5.71533636980021e-08 3.18071085537748e-05
## 667 SL000586:F2 8.45933738132309e-08 3.18071085537748e-05
## 53 SL011888:SMOC1 2.71041139465109e-07 7.64336013291607e-05
## 31 SL004557:CD59 3.89726779371348e-07 8.79223614261761e-05
# store the results
# write.table(het_CDCS, "ProteomicsData/VarTest_CDCS.txt", sep = "\t")
###################################
# test for variance in IMMACULATE #
###################################
het_IMMACULATE <- LeveneTest(Expr = data_IMMACULATE$Expr,
Design = data_IMMACULATE$Design,
Annotation = data_IMMACULATE$Annotation,
cohort = "IMMACULATE",
diseaseColumn = 38)
het_IMMACULATE[1:5,]
## Protein LeveneP adj.P.Val
## 315 SL005205:NCR3 5.11693747145376e-05 0.055584265274954
## 817 SL000087:IL6 9.405121027911e-05 0.055584265274954
## 731 SL011535:RBM39 0.000375594725688858 0.14798432192141
## 697 SL000586:F2 0.000594794949622844 0.17576190761355
## 95 SL000053:PLAT 0.00157844533766554 0.23648242138832
# store the results
# write.table(het_IMMACULATE, "ProteomicsData/VarTest_IMMACULATE.txt", sep = "\t")
In CDCS we identified 50 proteins with significant adjusted P-values at FDR = 1% while in the IMMACULATE there were no such significant differences, potentially due to the small sample size of HF group.
# significant levene test in CDCS
table(as.numeric(as.character(het_CDCS$adj.P.Val)) <= 0.01)
##
## FALSE TRUE
## 1078 50
# significant levene test in IMMACULATE
table(as.numeric(as.character(het_IMMACULATE$adj.P.Val)) <= 0.01)
##
## FALSE
## 1182
The differential expression analysis was performed with the limma model as follows:
# protein differential expression in CDCS
DEs_CDCS <- DE_forHF(Expr = data_CDCS$Expr,
Design = data_CDCS$Design,
Annotation = data_CDCS$Annotation,
Cohort = "CDCS")
DEs_CDCS[1:5, ]
## SomaId TargetFullName Target
## SL002785:NPPB SL002785 N-terminal_pro-BNP N-terminal_pro-BNP
## SL001777:CST3 SL001777 Cystatin-C Cystatin_C
## SL006119:TFF3 SL006119 Trefoil_factor_3 TFF3
## SL009324:FSTL3 SL009324 Follistatin-related_protein_3 FSTL3
## SL008099:CAPG SL008099 Macrophage-capping_protein CAPG
## UniProt EntrezGeneID EntrezGeneSymbol Flag logFC AveExpr
## SL002785:NPPB P16860 4879 NPPB PASS 0.8391699 3.571936
## SL001777:CST3 P01034 1471 CST3 PASS 0.2552607 2.947754
## SL006119:TFF3 Q07654 7033 TFF3 PASS 0.3352670 3.571181
## SL009324:FSTL3 O95633 10272 FSTL3 PASS 0.2234513 3.348035
## SL008099:CAPG P40121 822 CAPG PASS 0.3413176 2.387919
## t P.Value adj.P.Val Comparison
## SL002785:NPPB 8.545723 2.207961e-16 2.881389e-13 HF-Control
## SL001777:CST3 6.939563 1.443106e-11 9.416268e-09 HF-Control
## SL006119:TFF3 6.757341 4.552645e-11 1.980400e-08 HF-Control
## SL009324:FSTL3 6.190932 1.390123e-09 3.754770e-07 HF-Control
## SL008099:CAPG 6.171111 1.560127e-09 3.754770e-07 HF-Control
# store the results
# write.table(DEs_CDCS, "ProteomicsData/Differential_Expression_CDCS.txt", sep = "\t")
We considered as differentially expressed the proteins with FDR < 5% (column adj.P.Val) that passed the quality control. The differentially expressed proteins are shown in a volcano plot:
# volcano plot of differential expression estimates in CDCS
VolcanoPlot(de = DEs_CDCS, cohort = "CDCS")
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
We found that 38 out of the 212 differentially expressed proteins have different variances across the HF and Control groups which may also contribute to the differences in the respective expression levels.
# differentially expressed proteins in CDCS
DEproteins_in_cdcs <- as.character(rownames(DEs_CDCS[DEs_CDCS$adj.P.Val <= 0.05 &
DEs_CDCS$Flag == "PASS", ]))
length(DEproteins_in_cdcs)
## [1] 212
# proteins with differential variability in CDCS
DVproteins_in_cdcs <- as.character(het_CDCS[as.numeric(as.character(het_CDCS$adj.P.Val)) <= 0.01, 1])
length(DVproteins_in_cdcs)
## [1] 50
# differentially expressed proteins with different variances across the patient groups
length(intersect(DEproteins_in_cdcs, DVproteins_in_cdcs))
## [1] 38
The IMMACULATE cohort was underpowered since it included only 33 HF cases which led to a low detection power of differentially expressed proteins (only 12 proteins passed the above criteria). Below we present the best 15 hits at FDR = 10% which include the famous NPPB (proBNP) and NPPA (atrial natriuretic factor) proteins. The most famous marker NT-proBNP is the next (16th) ranked with FDR = 0.106.
# protein differential expression in IMMACULATE
DEs_IMMACULATE <- DE_forHF(Expr = data_IMMACULATE$Expr,
Design = data_IMMACULATE$Design,
Annotation = data_IMMACULATE$Annotation,
Cohort = "IMMACULATE")
DEs_IMMACULATE[1:15, ]
## SomaId
## SL007033:LTBP4 SL007033
## SL007696:CD93 SL007696
## SL003320:FIGF SL003320
## SL005115:SPON1 SL005115
## SL007680:ROBO2 SL007680
## SL008416:MRC2 SL008416
## SL001996:ANGPT2 SL001996
## SL009324:FSTL3 SL009324
## SL004258:ADIPOQ SL004258
## SL000053:PLAT SL000053
## SL000087:IL6 SL000087
## SL000306:NPPB SL000306
## SL000124:MMP2 SL000124
## SL002505:NPPA SL002505
## SL000076:CDKN1B SL000076
## TargetFullName
## SL007033:LTBP4 Latent-transforming_growth_factor_beta-binding_protein_4
## SL007696:CD93 Complement_component_C1q_receptor
## SL003320:FIGF Vascular_endothelial_growth_factor_D
## SL005115:SPON1 Spondin-1
## SL007680:ROBO2 Roundabout_homolog_2
## SL008416:MRC2 C-type_mannose_receptor_2
## SL001996:ANGPT2 Angiopoietin-2
## SL009324:FSTL3 Follistatin-related_protein_3
## SL004258:ADIPOQ Adiponectin
## SL000053:PLAT Tissue-type_plasminogen_activator
## SL000087:IL6 Interleukin-6
## SL000306:NPPB Brain_natriuretic_peptide_32
## SL000124:MMP2 72_kDa_type_IV_collagenase
## SL002505:NPPA Atrial_natriuretic_factor
## SL000076:CDKN1B Cyclin-dependent_kinase_inhibitor_1B
## Target UniProt EntrezGeneID EntrezGeneSymbol FlagA
## SL007033:LTBP4 LTBP4 Q8N2S1 8425 LTBP4 PASS
## SL007696:CD93 C1QR1 Q9NPY3 22918 CD93 PASS
## SL003320:FIGF VEGF-D O43915 2277 FIGF PASS
## SL005115:SPON1 Spondin-1 Q9HCB6 10418 SPON1 PASS
## SL007680:ROBO2 ROBO2 Q9HCK4 6092 ROBO2 PASS
## SL008416:MRC2 MRC2 Q9UBG0 9902 MRC2 PASS
## SL001996:ANGPT2 Angiopoietin-2 O15123 285 ANGPT2 PASS
## SL009324:FSTL3 FSTL3 O95633 10272 FSTL3 PASS
## SL004258:ADIPOQ Adiponectin Q15848 9370 ADIPOQ PASS
## SL000053:PLAT tPA P00750 5327 PLAT PASS
## SL000087:IL6 IL-6 P05231 3569 IL6 PASS
## SL000306:NPPB BNP-32 P16860 4879 NPPB PASS
## SL000124:MMP2 MMP-2 P08253 4313 MMP2 PASS
## SL002505:NPPA ANP P01160 4878 NPPA PASS
## SL000076:CDKN1B p27Kip1 P46527 1027 CDKN1B PASS
## Protein logFC AveExpr t P.Value
## SL007033:LTBP4 SL007033:LTBP4 0.2451720 3.007896 5.977009 8.858637e-09
## SL007696:CD93 SL007696:CD93 0.2532191 3.988323 5.072210 8.235112e-07
## SL003320:FIGF SL003320:FIGF 0.3115731 2.524268 5.021685 1.044086e-06
## SL005115:SPON1 SL005115:SPON1 0.2287463 3.259998 4.980009 1.268187e-06
## SL007680:ROBO2 SL007680:ROBO2 0.1932418 3.504850 4.812317 2.739857e-06
## SL008416:MRC2 SL008416:MRC2 0.1983016 3.551319 4.437876 1.424627e-05
## SL001996:ANGPT2 SL001996:ANGPT2 0.2977880 2.116766 4.393125 1.723218e-05
## SL009324:FSTL3 SL009324:FSTL3 0.1766642 3.485938 4.054383 6.935081e-05
## SL004258:ADIPOQ SL004258:ADIPOQ 0.3155099 3.274410 3.971442 9.625537e-05
## SL000053:PLAT SL000053:PLAT 0.3399825 2.397425 3.834363 1.635838e-04
## SL000087:IL6 SL000087:IL6 0.3450569 2.361511 3.693566 2.778209e-04
## SL000306:NPPB SL000306:NPPB 0.5084806 2.602311 3.601073 3.901509e-04
## SL000124:MMP2 SL000124:MMP2 0.1421153 3.852925 3.526748 5.100632e-04
## SL002505:NPPA SL002505:NPPA 0.4497471 2.833070 3.449183 6.715464e-04
## SL000076:CDKN1B SL000076:CDKN1B -0.5096623 2.879893 -3.311893 1.079999e-03
## adj.P.Val Comparison
## SL007033:LTBP4 1.156052e-05 HF-Control
## SL007696:CD93 4.137462e-04 HF-Control
## SL003320:FIGF 4.137462e-04 HF-Control
## SL005115:SPON1 4.137462e-04 HF-Control
## SL007680:ROBO2 7.151027e-04 HF-Control
## SL008416:MRC2 3.098563e-03 HF-Control
## SL001996:ANGPT2 3.212570e-03 HF-Control
## SL009324:FSTL3 1.131285e-02 HF-Control
## SL004258:ADIPOQ 1.395703e-02 HF-Control
## SL000053:PLAT 2.134768e-02 HF-Control
## SL000087:IL6 3.295966e-02 HF-Control
## SL000306:NPPB 4.242891e-02 HF-Control
## SL000124:MMP2 5.120249e-02 HF-Control
## SL002505:NPPA 6.259772e-02 HF-Control
## SL000076:CDKN1B 9.395991e-02 HF-Control
# store the results
# write.table(DEs_IMMACULATE, "ProteomicsData/Differential_Expression_IMMACULATE.txt", sep = "\t")
# number of significant proteins in IMMACULATE (FDR = 5%)
DEproteins_in_immaculate5 <- as.character(rownames(DEs_IMMACULATE[DEs_IMMACULATE$adj.P.Val <= 0.05 &
DEs_IMMACULATE$FlagA == "PASS", ]))
length(DEproteins_in_immaculate5)
## [1] 12
# number of significant proteins in IMMACULATE (FDR = 10%)
DEproteins_in_immaculate10 <- as.character(rownames(DEs_IMMACULATE[DEs_IMMACULATE$adj.P.Val <= 0.1 &
DEs_IMMACULATE$FlagA == "PASS", ]))
length(DEproteins_in_immaculate10)
## [1] 15
# the rank of NT-proBNP
grep("N-terminal_pro-BNP", DEs_IMMACULATE$TargetFullName)
## [1] 16
DEs_IMMACULATE[grep("N-terminal_pro-BNP", DEs_IMMACULATE$TargetFullName), ]
## SomaId TargetFullName Target UniProt
## SL002785:NPPB SL002785 N-terminal_pro-BNP N-terminal_pro-BNP P16860
## EntrezGeneID EntrezGeneSymbol FlagA Protein logFC
## SL002785:NPPB 4879 NPPB PASS SL002785:NPPB 0.7968104
## AveExpr t P.Value adj.P.Val Comparison
## SL002785:NPPB 4.257362 3.255048 0.001309033 0.106768 HF-Control
In terms of differential expression at FDR < 5%, we find 8 reproducible proteins across the two cohorts:
# common proteins by FDR
common_by_fdr <- intersect(DEproteins_in_immaculate5, DEproteins_in_cdcs)
# common by FDR and directionality
common_CDCS <- DEs_CDCS[match(common_by_fdr, rownames(DEs_CDCS)), ]
common_IMMACULATE <- DEs_IMMACULATE[match(common_by_fdr, rownames(DEs_IMMACULATE)), ]
ww <- sign(common_CDCS[,8]) == sign(common_IMMACULATE[,9])
table(ww)
## ww
## FALSE TRUE
## 1 8
common_CDCS <- common_CDCS[ww, ]
common_IMMACULATE <- common_IMMACULATE[ww, ]
# common proteins
as.character(rownames(common_CDCS))
## [1] "SL007033:LTBP4" "SL007696:CD93" "SL003320:FIGF" "SL005115:SPON1"
## [5] "SL001996:ANGPT2" "SL009324:FSTL3" "SL004258:ADIPOQ" "SL000306:NPPB"
We used the meta-analysis metaRNASeq algorithm to combine the information of the CDCS and IMMACULATE experiments and provide an indication as to which proteins tend to be significant in both cohorts (and thus cross-cohort validated). We used strict criteria to identify them, i.e. they should be (1) differentially expressed in the CDCS (FDR < 5% that pass the quality control), (2) show evidence of differential expression in the IMMACULATE (P-value < 5% that pass the quality control), (3) differentially expressed in the meta-analysis (meta-analysis FDR < 5%) and (4) the logFCs in the CDCS and the IMMACULATE have the same direction (both up-regulated or both down-regulated in HF).
# meta-analysis
meta <- metaCohorts(DE_CDCS = DEs_CDCS,
DE_IMMACULATE = DEs_IMMACULATE)
head(meta)
## isDE_byFDR5perc_CDCS isDE_byFDR5perc_IMM isDE_byP5perc_IMM
## SL002785:NPPB TRUE FALSE TRUE
## SL006119:TFF3 TRUE FALSE TRUE
## SL009324:FSTL3 TRUE TRUE TRUE
## SL006527:EFEMP1 TRUE FALSE TRUE
## SL003869:GDF15 TRUE FALSE TRUE
## SL003320:FIGF TRUE TRUE TRUE
## signIndex Fisher_meta_FDR Validated
## SL002785:NPPB 1 0.000000e+00 TRUE
## SL006119:TFF3 1 1.032246e-08 TRUE
## SL009324:FSTL3 1 1.099741e-09 TRUE
## SL006527:EFEMP1 1 4.413441e-08 TRUE
## SL003869:GDF15 1 2.655746e-07 TRUE
## SL003320:FIGF 1 2.503278e-10 TRUE
# store the results
# write.table(meta, "ProteomicsData/Meta_IMMACULATE.txt", sep = "\t")
# number of validated proteins (TRUE)
table(meta$Validated)
##
## FALSE TRUE
## 1069 36
This procedure found 36 cross-cohort validated proteins of which NPPB (NT-proBNP) was at the top of the list (meta-analysis FDR < 1e-16). Several other famous candidates also appeared such as TNNT2, NPPB (pro-BNP), TNNI3 etc.
To check the importance of the significant proteins derived from the IMMACULATE cohort, we examine the percentage of variance explained by them in a PCA plot and visualise how well they separate the patient groups of interest with tSNE. First we check the 15 proteins that were found to be differentially expressed at FDR = 10%. The results show a clear improvement in the separation of HF and Control groups compared to the above (all proteins) tSNE representation.
# differentially expressed proteins at FDR = 10%
proteins15 <- rownames( DEs_IMMACULATE[ DEs_IMMACULATE$adj.P.Val <= 0.1,])
# significant proteins at FDR = 10%
k15 <- redDim_IMMACULATE(Expr = data_IMMACULATE$Expr,
Design = data_IMMACULATE$Design,
Annotation = data_IMMACULATE$Annotation,
proteins = proteins15,
diseaseColumn = 38,
legend = "significant proteins at FDR = 10%")
## NULL
Next, we study the 36 proteins that have been validated in both cohorts by meta-analysis. Note that 5 out of the 15 proteins with FDR < 10% in the IMMACULATE cohort were not validated in the meta-analysis (they were not highly significant in the CDCS), 10 proteins were common by both methods and 26 proteins were found only in the meta-analysis (among which is the famous NT-proBNP marker). As before, the separation of HF and Control groups is clear.
# all validated proteins of meta-analysis
proteins36 <- rownames(meta)[meta$Validated]
# proteins in top 15 that were not cross-cohort validated
setdiff(proteins15, proteins36)
## [1] "SL007680:ROBO2" "SL008416:MRC2" "SL000053:PLAT" "SL000124:MMP2"
## [5] "SL000076:CDKN1B"
# common proteins in top 15 cross-cohort validation
intersect(proteins15, proteins36)
## [1] "SL007033:LTBP4" "SL007696:CD93" "SL003320:FIGF" "SL005115:SPON1"
## [5] "SL001996:ANGPT2" "SL009324:FSTL3" "SL004258:ADIPOQ" "SL000087:IL6"
## [9] "SL000306:NPPB" "SL002505:NPPA"
# cross-cohort validated proteins that were not among the top 15
setdiff(proteins36, proteins15)
## [1] "SL002785:NPPB" "SL006119:TFF3" "SL006527:EFEMP1"
## [4] "SL003869:GDF15" "SL000052:TNNT2" "SL007056:BMP10"
## [7] "SL006610:ADAMTS13" "SL007206:THBS2" "SL004646:LAYN"
## [10] "SL009400:CHRDL1" "SL011888:SMOC1" "SL004060:ECE1"
## [13] "SL003327:CFD" "SL008382:CST5" "SL003324:F10"
## [16] "SL003770:SFRP1" "SL005230:UNC5C" "SL005789:STC1"
## [19] "SL000592:TIMP2" "SL001761:TNNI3" "SL006924:AGRP"
## [22] "SL005209:NOTCH3" "SL010381:DKKL1" "SL002823:SELL"
## [25] "SL006230:LUM" "SL000462:IGFBP1"
# cross-cohort validated proteins
k36 <- redDim_IMMACULATE(Expr = data_IMMACULATE$Expr,
Design = data_IMMACULATE$Design,
Annotation = data_IMMACULATE$Annotation,
proteins = proteins36,
diseaseColumn = 38,
legend = "36 cross-cohort validated proteins")
## NULL
Finally, we study only the separation offered by the 26 cross-cohort validated proteins that are not detected by IMMACULATE’s FDR cutoff. Again, the separation of the two groups is visually clear indicating the importance of these candidates.
# proteins uniquely found in meta-analysis
proteins26 <- setdiff(proteins36, proteins15)
# cross-cohort validated proteins with FDR > 10%
k26 <- redDim_IMMACULATE(Expr = data_IMMACULATE$Expr,
Design = data_IMMACULATE$Design,
Annotation = data_IMMACULATE$Annotation,
proteins = proteins26,
diseaseColumn = 38,
legend = "26 cross-cohort validated proteins with FDR > 10%")
## NULL
To quantify the significance of the 36 proteins extracted from the meta-analysis, we developed a simple methodology that compares the HF and Control clusters generated from those proteins with the clusters formed by a random selection of any 36 proteins. The random selection is taken from the pool of the CDCS significant proteins in order to build a strict testing procedure that takes into account meaningful candidates. The algorithm we developed works as follows:
Step 1: Estimate the test statistic \(t = (M_{Control} - M_{HF}) / \sqrt{(Var_{Control} + Var_{HF})\) where \(M_{Control}\) is the 2D mean of the Control group samples estimated from their PCA coordinates, \(M_{HF}\) is the 2D mean of the HF group samples estimated from their PCA coordinates, \(Var_{Control}\) is the variance of the Control group estimated from the within cluster sum of squares using Euclidean distances and \(Var_{HF}\) is the variance of the HF group estimated from the within cluster sum of squares using Euclidean distances. This is similar to a t-statistic quantifying the mean difference of two samples with unequal variances and sample sizes.
Step 2: At the first iteration b = 1, select 36 random proteins from the pool of the 212 significant CDCS proteins, estimate a PCA and evaluate the resampling statistic \(t^* = (M^*_{Control} - M^*_{HF}) / \sqrt{(Var^*_{Control} + Var^*_{HF})\) where \(M^*_{Control}\), \(M^*_{HF}\), \(Var^*_{Control}\) and \(Var^*_{HF}\) are defined as above.
Step 3: Iterate for b = 2, …, B = 10,000 and compare the resampling statistics \(t^*\) against \(t\) to obtain a resampling p-value: \(P = #(t^* > t) / B\) that evaluates the significance of the group separation offered by the 36 identified proteins.
# run the algorithm 10,000 times
background.proteins <- DEproteins_in_cdcs
clStat <- clusterSignificance(pcadata = k36,
randomSet = 36,
proteins = background.proteins,
B = 10000)
# store the results of the t*-statistics
# write(clStat$testB, "ProteomicsData/tstar.txt", ncolumns = 1)
This procedure indicated that the 36 selected proteins can significantly separate the HF and Control groups with P < 1e-4.
# read the stored t-statistics for simplicity
clStat_all.t <- scan("ProteomicsData/tstar.txt")
# the t-statistic of the 36 set is the first element
clStat_t <- clStat_all.t[1]
clStat_t
## [1] 0.7527349
# the t*-statistics of the resampling are the rest of the elements
clStat_tstar <- clStat_all.t[2:length(clStat_all.t)]
# see the histogram of the 10,000 t*-statistics
hist(clStat_tstar, xlim = c(0, 1), xlab = "t-statistics",
main = "Cluster comparison statistic of 10,000 random protein sets",
sub = "The vertical line is the t-stat of the 36 proteins")
abline(v = clStat_t)
#p1<-ggplot(hey,aes(x=clStat_tstar)) + geom_histogram(colour="gray") +
# labs(y="Frequency",x="t-star") +
# geom_vline(xintercept=clStat_t,colour="red",linetype="dashed") +
# lims(x=c(0,0.8))
The above methodology confirmed that the 36 protein separation of the groups in the IMMACULATE is not random. We are also interested how well this validated set performs in predicting the HF and Control groups in both cohorts. To do this, we developed the following machine learning methodology using supervised Random Forests:
Step 1: Apply the supervised random forests algorithm on the 36 cross-cohort validated proteins in the IMMACULATE test cohort and predict the HF and Control groups in the CDCS cohort. Calculate the confusion matrix, \(k\), between the estimated and the known disease groups.
Step 2: Sort the CDCS and the IMMACULATE DE results by FDR, pick those with FDR < 5% in both sets and repeat Step 1 to find the new confusion matrix \(top_1\).
Step 3: Iterate with a small step \(\epsilon = 0.01\) to pick the proteins with FDR < 5% + \(\epsilon\) and find new confusion matrices \(top_2\), …, \(top_B\). Stop the process at the cutoff FDR = 0.5.
Step 4: Compare the estimation error of \(k\) against the estimation errors of the confusion matrices of the top IMMACULATE proteins.
# random forest application with 2,000 trees
rf<-rfGroupSeparation(Expr_CDCS = data_CDCS$Expr,
Expr_IMMACULATE = data_IMMACULATE$Expr,
Design_CDCS = data_CDCS$Design,
Design_IMMACULATE = data_IMMACULATE$Design,
DEs_CDCS = DEs_CDCS,
DEs_IMMACULATE = DEs_IMMACULATE,
proteins = proteins36,
ntree = 2000,
starting.fdr = 0.05,
ending.fdr = 0.5)
# store the results
# write.table(rf, "ProteomicsData/RF.txt", sep = "\t")
The results shows that our 36 proteins is the top predictive set exhibiting minimal estimation error in both individual groups and in total (mean). Of particular importance is that, as the figures show, these 36 proteins form a relatively large set whose prediction accuracy is directly comparable to the accuracies of much smaller protein sets in Control. The moderate to low error magnitudes could be attributed to the data noise, the unbalanced design, the ethnicity differences across the two cohorts and, importantly, to the fact that the Control and HF patients are essentially both disease groups (there is no classical healthy control).
# read the stored estimation errors for simplicity
rf <- read.table("ProteomicsData/RF.txt", sep = "\t", header = T)
rf
## N.of.proteins N Y
## 1 8 0.3675676 0.2622951
## 2 9 0.3691173 0.2565789
## 3 10 0.3676471 0.2368421
## 4 11 0.3671551 0.2301065
## 5 13 0.3701405 0.2297313
## 6 14 0.3713528 0.2407407
## 7 16 0.3713528 0.2407407
## 8 17 0.3728041 0.2288185
## 9 18 0.3733681 0.2083333
## 10 21 0.3762892 0.2111936
## 11 22 0.3775773 0.1976744
## 12 26 0.3788660 0.2093023
## 13 27 0.3791774 0.2023810
## 14 28 0.3810701 0.2001251
## 15 32 0.3801020 0.1794872
## 16 33 0.3826531 0.2051282
## 17 36 0.3704663 0.1555556
## 18 37 0.3762887 0.1860465
## 19 40 0.3792878 0.1911555
## 20 41 0.3794872 0.1951220
# estimate the mean errors of both groups
me <- apply(rf[, 2:3], 1, mean)
# plot the results
par(mfrow=c(2, 2))
plot(rf[rf[,1] != 36, 1:2],
pch = 20, cex = 1.5, ylim = range(rf[, 2]),
xlab = "Number of proteins",
ylab = "Estimation error (Control)",
sub = "The cross indicates the estimate of the 36 proteins")
lines(lowess(rf[rf[, 1] != 36, 1], rf[rf[, 1] != 36, 2]))
points(rf[rf[, 1] == 36, 1:2], col = 2, pch = 3)
plot(rf[rf[,1] != 36, c(1, 3)],
pch = 20, cex = 1.5, ylim = range(rf[, 3]),
xlab = "Number of proteins",
ylab = "Estimation error (HF)",
sub = "The cross indicates the estimate of the 36 proteins")
lines(lowess(rf[rf[, 1] != 36, 1], rf[rf[, 1] != 36, 3]))
points(rf[rf[, 1] == 36, c(1, 3)], col = 2, pch = 3)
plot(rf[rf[,1] != 36, 1], me[rf[,1] != 36],
pch = 20, cex = 1.5, ylim = range(me),
xlab = "Number of proteins",
ylab = "Estimation error (Mean)",
sub = "The cross indicates the estimate of the 36 proteins")
lines(lowess(rf[rf[, 1] != 36, 1], me[rf[, 1] != 36]))
points(rf[rf[, 1] == 36, 1], me[rf[, 1] == 36], col = 2, pch = 3)
Our methodology keeps an important amount of information that would have been lost if we used only the FDR for validating the candidates of both cohorts. This is clearly highlighted by looking at the differential expression statistics of the 36 proteins in the IMMACULATE (by default all of them are significant in CDCS). This set cannot be reproduced by a method that is based on FDR-sorteds proteins:
# differential expressin statistics of the 36 proteins in the IMMACULATE
DEs_IMMACULATE[match(proteins36, rownames(DEs_IMMACULATE)),]
## SomaId
## SL002785:NPPB SL002785
## SL006119:TFF3 SL006119
## SL009324:FSTL3 SL009324
## SL006527:EFEMP1 SL006527
## SL003869:GDF15 SL003869
## SL003320:FIGF SL003320
## SL000052:TNNT2 SL000052
## SL001996:ANGPT2 SL001996
## SL007056:BMP10 SL007056
## SL000306:NPPB SL000306
## SL006610:ADAMTS13 SL006610
## SL007206:THBS2 SL007206
## SL004646:LAYN SL004646
## SL004258:ADIPOQ SL004258
## SL005115:SPON1 SL005115
## SL009400:CHRDL1 SL009400
## SL011888:SMOC1 SL011888
## SL004060:ECE1 SL004060
## SL003327:CFD SL003327
## SL008382:CST5 SL008382
## SL003324:F10 SL003324
## SL003770:SFRP1 SL003770
## SL005230:UNC5C SL005230
## SL005789:STC1 SL005789
## SL000592:TIMP2 SL000592
## SL007696:CD93 SL007696
## SL001761:TNNI3 SL001761
## SL006924:AGRP SL006924
## SL005209:NOTCH3 SL005209
## SL007033:LTBP4 SL007033
## SL010381:DKKL1 SL010381
## SL002823:SELL SL002823
## SL006230:LUM SL006230
## SL000462:IGFBP1 SL000462
## SL002505:NPPA SL002505
## SL000087:IL6 SL000087
## TargetFullName
## SL002785:NPPB N-terminal_pro-BNP
## SL006119:TFF3 Trefoil_factor_3
## SL009324:FSTL3 Follistatin-related_protein_3
## SL006527:EFEMP1 EGF-containing_fibulin-like_extracellular_matrix_protein_1
## SL003869:GDF15 Growth/differentiation_factor_15
## SL003320:FIGF Vascular_endothelial_growth_factor_D
## SL000052:TNNT2 Troponin_T,_cardiac_muscle
## SL001996:ANGPT2 Angiopoietin-2
## SL007056:BMP10 Bone_morphogenetic_protein_10
## SL000306:NPPB Brain_natriuretic_peptide_32
## SL006610:ADAMTS13 A_disintegrin_and_metalloproteinase_with_thrombospondin_motifs_13
## SL007206:THBS2 Thrombospondin-2
## SL004646:LAYN Layilin
## SL004258:ADIPOQ Adiponectin
## SL005115:SPON1 Spondin-1
## SL009400:CHRDL1 Chordin-like_protein_1
## SL011888:SMOC1 SPARC-related_modular_calcium-binding_protein_1
## SL004060:ECE1 Endothelin-converting_enzyme_1
## SL003327:CFD Complement_factor_D
## SL008382:CST5 Cystatin-D
## SL003324:F10 Coagulation_factor_Xa
## SL003770:SFRP1 Secreted_frizzled-related_protein_1
## SL005230:UNC5C Netrin_receptor_UNC5C
## SL005789:STC1 Stanniocalcin-1
## SL000592:TIMP2 Metalloproteinase_inhibitor_2
## SL007696:CD93 Complement_component_C1q_receptor
## SL001761:TNNI3 Troponin_I,_cardiac_muscle
## SL006924:AGRP Agouti-related_protein
## SL005209:NOTCH3 Neurogenic_locus_notch_homolog_protein_3
## SL007033:LTBP4 Latent-transforming_growth_factor_beta-binding_protein_4
## SL010381:DKKL1 Dickkopf-like_protein_1
## SL002823:SELL L-Selectin
## SL006230:LUM Lumican
## SL000462:IGFBP1 Insulin-like_growth_factor-binding_protein_1
## SL002505:NPPA Atrial_natriuretic_factor
## SL000087:IL6 Interleukin-6
## Target UniProt EntrezGeneID
## SL002785:NPPB N-terminal_pro-BNP P16860 4879
## SL006119:TFF3 TFF3 Q07654 7033
## SL009324:FSTL3 FSTL3 O95633 10272
## SL006527:EFEMP1 FBLN3 Q12805 2202
## SL003869:GDF15 MIC-1 Q99988 9518
## SL003320:FIGF VEGF-D O43915 2277
## SL000052:TNNT2 Troponin_T P45379 7139
## SL001996:ANGPT2 Angiopoietin-2 O15123 285
## SL007056:BMP10 BMP10 O95393 27302
## SL000306:NPPB BNP-32 P16860 4879
## SL006610:ADAMTS13 ATS13 Q76LX8 11093
## SL007206:THBS2 TSP2 P35442 7058
## SL004646:LAYN Layilin Q6UX15 143903
## SL004258:ADIPOQ Adiponectin Q15848 9370
## SL005115:SPON1 Spondin-1 Q9HCB6 10418
## SL009400:CHRDL1 CRDL1 Q9BU40 91851
## SL011888:SMOC1 SMOC1 Q9H4F8 64093
## SL004060:ECE1 Endothelin-converting_enzyme_1 P42892 1889
## SL003327:CFD Factor_D P00746 1675
## SL008382:CST5 CYTD P28325 1473
## SL003324:F10 Coagulation_Factor_Xa P00742 2159
## SL003770:SFRP1 SARP-2 Q8N474 6422
## SL005230:UNC5C UNC5H3 O95185 8633
## SL005789:STC1 Stanniocalcin-1 P52823 6781
## SL000592:TIMP2 TIMP-2 P16035 7077
## SL007696:CD93 C1QR1 Q9NPY3 22918
## SL001761:TNNI3 Troponin_I P19429 7137
## SL006924:AGRP ART O00253 181
## SL005209:NOTCH3 Notch-3 Q9UM47 4854
## SL007033:LTBP4 LTBP4 Q8N2S1 8425
## SL010381:DKKL1 Soggy-1 Q9UK85 27120
## SL002823:SELL sL-Selectin P14151 6402
## SL006230:LUM Lumican P51884 4060
## SL000462:IGFBP1 IGFBP-1 P08833 3484
## SL002505:NPPA ANP P01160 4878
## SL000087:IL6 IL-6 P05231 3569
## EntrezGeneSymbol FlagA Protein logFC AveExpr
## SL002785:NPPB NPPB PASS SL002785:NPPB 0.79681036 4.257362
## SL006119:TFF3 TFF3 PASS SL006119:TFF3 0.12252184 3.828077
## SL009324:FSTL3 FSTL3 PASS SL009324:FSTL3 0.17666420 3.485938
## SL006527:EFEMP1 EFEMP1 PASS SL006527:EFEMP1 0.12668089 2.823353
## SL003869:GDF15 GDF15 PASS SL003869:GDF15 0.23869490 2.398806
## SL003320:FIGF FIGF PASS SL003320:FIGF 0.31157307 2.524268
## SL000052:TNNT2 TNNT2 PASS SL000052:TNNT2 0.26394286 3.343274
## SL001996:ANGPT2 ANGPT2 PASS SL001996:ANGPT2 0.29778799 2.116766
## SL007056:BMP10 BMP10 PASS SL007056:BMP10 0.14511313 3.053178
## SL000306:NPPB NPPB PASS SL000306:NPPB 0.50848065 2.602311
## SL006610:ADAMTS13 ADAMTS13 PASS SL006610:ADAMTS13 -0.13719979 3.755110
## SL007206:THBS2 THBS2 PASS SL007206:THBS2 0.23227681 3.701019
## SL004646:LAYN LAYN PASS SL004646:LAYN 0.18864544 2.935491
## SL004258:ADIPOQ ADIPOQ PASS SL004258:ADIPOQ 0.31550991 3.274410
## SL005115:SPON1 SPON1 PASS SL005115:SPON1 0.22874628 3.259998
## SL009400:CHRDL1 CHRDL1 PASS SL009400:CHRDL1 0.10773789 3.249004
## SL011888:SMOC1 SMOC1 PASS SL011888:SMOC1 0.16164263 3.401509
## SL004060:ECE1 ECE1 PASS SL004060:ECE1 -0.11222754 4.410959
## SL003327:CFD CFD PASS SL003327:CFD 0.13979577 2.900248
## SL008382:CST5 CST5 PASS SL008382:CST5 0.46872856 3.643594
## SL003324:F10 F10 PASS SL003324:F10 -0.12595674 3.790039
## SL003770:SFRP1 SFRP1 PASS SL003770:SFRP1 0.33547700 3.162030
## SL005230:UNC5C UNC5C PASS SL005230:UNC5C 0.10127373 3.400777
## SL005789:STC1 STC1 PASS SL005789:STC1 0.14287833 3.062556
## SL000592:TIMP2 TIMP2 PASS SL000592:TIMP2 0.12737344 2.491127
## SL007696:CD93 CD93 PASS SL007696:CD93 0.25321915 3.988323
## SL001761:TNNI3 TNNI3 PASS SL001761:TNNI3 0.43004405 2.892466
## SL006924:AGRP AGRP PASS SL006924:AGRP 0.19687481 3.404004
## SL005209:NOTCH3 NOTCH3 PASS SL005209:NOTCH3 0.17949223 2.735815
## SL007033:LTBP4 LTBP4 PASS SL007033:LTBP4 0.24517197 3.007896
## SL010381:DKKL1 DKKL1 PASS SL010381:DKKL1 -0.08913164 3.519612
## SL002823:SELL SELL PASS SL002823:SELL -0.10073833 3.747984
## SL006230:LUM LUM PASS SL006230:LUM 0.13954849 2.987868
## SL000462:IGFBP1 IGFBP1 PASS SL000462:IGFBP1 0.64242713 3.256480
## SL002505:NPPA NPPA PASS SL002505:NPPA 0.44974714 2.833070
## SL000087:IL6 IL6 PASS SL000087:IL6 0.34505690 2.361511
## t P.Value adj.P.Val Comparison
## SL002785:NPPB 3.255048 1.309033e-03 1.067680e-01 HF-Control
## SL006119:TFF3 2.104139 3.647995e-02 4.760633e-01 HF-Control
## SL009324:FSTL3 4.054383 6.935081e-05 1.131285e-02 HF-Control
## SL006527:EFEMP1 3.062156 2.465906e-03 1.532385e-01 HF-Control
## SL003869:GDF15 2.436193 1.562264e-02 3.376289e-01 HF-Control
## SL003320:FIGF 5.021685 1.044086e-06 4.137462e-04 HF-Control
## SL000052:TNNT2 2.954404 3.466729e-03 1.774969e-01 HF-Control
## SL001996:ANGPT2 4.393125 1.723218e-05 3.212570e-03 HF-Control
## SL007056:BMP10 2.505386 1.294226e-02 3.129725e-01 HF-Control
## SL000306:NPPB 3.601073 3.901509e-04 4.242891e-02 HF-Control
## SL006610:ADAMTS13 -2.521217 1.238986e-02 3.109378e-01 HF-Control
## SL007206:THBS2 2.397205 1.734025e-02 3.428641e-01 HF-Control
## SL004646:LAYN 2.440980 1.542246e-02 3.376289e-01 HF-Control
## SL004258:ADIPOQ 3.971442 9.625537e-05 1.395703e-02 HF-Control
## SL005115:SPON1 4.980009 1.268187e-06 4.137462e-04 HF-Control
## SL009400:CHRDL1 2.704883 7.357245e-03 2.671586e-01 HF-Control
## SL011888:SMOC1 2.470534 1.423635e-02 3.259375e-01 HF-Control
## SL004060:ECE1 -2.197500 2.900799e-02 4.301753e-01 HF-Control
## SL003327:CFD 2.153484 3.234718e-02 4.638799e-01 HF-Control
## SL008382:CST5 3.164242 1.770295e-03 1.274557e-01 HF-Control
## SL003324:F10 -1.994254 4.733506e-02 5.234936e-01 HF-Control
## SL003770:SFRP1 2.534467 1.194377e-02 3.109378e-01 HF-Control
## SL005230:UNC5C 2.013657 4.524007e-02 5.088846e-01 HF-Control
## SL005789:STC1 2.115893 3.545652e-02 4.760633e-01 HF-Control
## SL000592:TIMP2 2.613157 9.577879e-03 2.680377e-01 HF-Control
## SL007696:CD93 5.072210 8.235112e-07 4.137462e-04 HF-Control
## SL001761:TNNI3 2.848276 4.804179e-03 2.089818e-01 HF-Control
## SL006924:AGRP 2.035826 4.294332e-02 5.075789e-01 HF-Control
## SL005209:NOTCH3 2.497738 1.321691e-02 3.129725e-01 HF-Control
## SL007033:LTBP4 5.977009 8.858637e-09 1.156052e-05 HF-Control
## SL010381:DKKL1 -2.029754 4.356232e-02 5.075789e-01 HF-Control
## SL002823:SELL -2.247749 2.556486e-02 4.044525e-01 HF-Control
## SL006230:LUM 3.153239 1.835411e-03 1.274557e-01 HF-Control
## SL000462:IGFBP1 3.016924 2.848259e-03 1.634302e-01 HF-Control
## SL002505:NPPA 3.449183 6.715464e-04 6.259772e-02 HF-Control
## SL000087:IL6 3.693566 2.778209e-04 3.295966e-02 HF-Control
We took the full set of the protein data and we looked for candidates that are significantly correlated (Spearman correlation) with the left ventricular ejection fraction (LVEF). We found 147 proteins that are significantly associated with LVEF, 96 of which are also differentially expressed (Fisher test for enrichment P-value < 2.2e-16). The analysis also shows that 27 proteins LVEF correlated proteins with different variances.
# Spearman correlation of LVEF and protein expresison profiles
cors <- LVEFcorr(Expr = data_CDCS$Expr,
Design = data_CDCS$Design,
Annotation = data_CDCS$Annotation)
head(cors)
## SomaId TargetFullName Target
## SL000591:TIMP1 SL000591 Metalloproteinase_inhibitor_1 TIMP-1
## SL007696:CD93 SL007696 Complement_component_C1q_receptor C1QR1
## SL000403:COL18A1 SL000403 Endostatin Endostatin
## SL005789:STC1 SL005789 Stanniocalcin-1 Stanniocalcin-1
## SL004827:S100A6 SL004827 Protein_S100-A6 S100A6
## SL007547:HAVCR2 SL007547 Hepatitis_A_virus_cellular_receptor_2 TIMD3
## UniProt EntrezGeneID EntrezGeneSymbol Flag Spearman_rho
## SL000591:TIMP1 P01033 7076 TIMP1 FLAG -0.232591644241594
## SL007696:CD93 Q9NPY3 22918 CD93 PASS -0.231770111493184
## SL000403:COL18A1 P39060 80781 COL18A1 PASS -0.230708662101433
## SL005789:STC1 P52823 6781 STC1 PASS -0.23037911581205
## SL004827:S100A6 P06703 6277 S100A6 PASS 0.22871792906718
## SL007547:HAVCR2 Q8TDQ0 84868 HAVCR2 PASS -0.227581824473883
## P.Value adj.P.Val
## SL000591:TIMP1 4.47156451511159e-06 0.000201220403180022
## SL007696:CD93 4.842365537641e-06 0.000210642900887383
## SL000403:COL18A1 5.36503194296819e-06 0.000225844947925728
## SL005789:STC1 5.53796040890675e-06 0.000225844947925728
## SL004827:S100A6 6.49368561021418e-06 0.000256795749131197
## SL007547:HAVCR2 7.23571797494963e-06 0.000277723881097331
# store the data
# write.table(cors, "ProteomicsData/Cors_CDCS.txt", sep = "\t")
# number of proteins that are significantly correlated with LVEF
LVEFproteins_in_cdcs <- as.character(rownames(cors[which(as.numeric(as.character(cors$adj.P.Val)) <= 0.05 &
cors$Flag == "PASS"), ]))
length(LVEFproteins_in_cdcs)
## [1] 147
# number of differentially expressed protein that are significantly correlated with LVEF
length(intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs))
## [1] 96
# enrichment of differentially expressed proteins that are significantly correlated with LVEF
mat <- matrix(c(length(intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs)),
length(intersect(LVEFproteins_in_cdcs, setdiff(rownames(DEs_CDCS),
DEproteins_in_cdcs))),
length(intersect(setdiff(rownames(cors), LVEFproteins_in_cdcs),
DEproteins_in_cdcs)),
length(intersect(setdiff(rownames(cors), LVEFproteins_in_cdcs),
setdiff(rownames(DEs_CDCS), DEproteins_in_cdcs)))),
nrow = 2)
colnames(mat) <- c("Corr_with_LVEF", "Uncorr_with_LVEF")
rownames(mat) <- c("DE", "non-DE")
mat
## Corr_with_LVEF Uncorr_with_LVEF
## DE 96 116
## non-DE 51 1042
fisher.test(mat)
##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 11.24168 25.49497
## sample estimates:
## odds ratio
## 16.84028
# number of proteins with significantly correlation with LVEF and different variances across the two groups
length(intersect(LVEFproteins_in_cdcs,DVproteins_in_cdcs))
## [1] 27
We merged the results of the differential expression and LVEF correlation analysis in order to plot the most important proteins of our study so far:
# merge the differential expression and LVEF corrlelation
sl1 <- sort.list(rownames(DEs_CDCS))
sl2 <- sort.list(rownames(cors))
merged <- cbind(DEs_CDCS[sl1,1:12], cors[sl2, 8:10])
colnames(merged)[colnames(merged) == "P.Value"] <- paste(c("DE_", "LVEF_"), "P.Value", sep = "")
colnames(merged)[colnames(merged) == "adj.P.Val"] <- paste(c("DE_", "LVEF_"), "adj.P.Val", sep = "")
# data to plot
data <- LVEFdata(Expr = data_CDCS$Expr,
Design = data_CDCS$Design,
Annotation = data_CDCS$Annotation,
merged_table = merged)
# the heatmap
LVEFheat(data = data)
# the smoothed expression of the protein clusters
LVEFsmooth(data = data)
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
The analysis identified 5 protein clusters with statistically significant correlation with LVEF (FDR < 5%). PGD protein is the most highly LVEF-correlated (Spearman’s rho = 0.276 and FDR = 6.67E-6). On the other hand, NT-proBNP is the most highly anti-correlated protein of the CDCS dataset (Spearman’s rho = -0.451 and FDR = 2.15E-17).
We utilised the Weighted Gene Co-expression Network Analysis (WGCNA) pipeline to study the HF AND Control specific co-regulation networks of our proteins in CDCS. First we identify the power parameter that approximates scale-free topology for each network:
# first step of WGCNA analysis
wgcnaData <- WGCNAanalysis(Expr = data_CDCS$Expr,
Design = data_CDCS$Design,
Annotation = data_CDCS$Annotation,
DEs = DEs_CDCS,
de_cut = 0.05,
ppHF = 7,
ppC = 7)
## pickSoftThreshold: will use block size 1128.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 1128 of 1128
## Warning: executing %dopar% sequentially: no parallel backend registered
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.713 -1.160 0.883 154.000 1.38e+02 331.00
## 2 2 0.969 -1.280 0.961 41.000 2.67e+01 162.00
## 3 3 0.972 -1.200 0.978 16.500 6.72e+00 101.00
## 4 4 0.950 -1.120 0.972 8.680 2.11e+00 72.10
## 5 5 0.947 -1.060 0.972 5.390 7.84e-01 56.10
## 6 6 0.932 -1.020 0.957 3.700 3.21e-01 45.50
## 7 7 0.943 -0.993 0.967 2.710 1.35e-01 38.10
## 8 8 0.949 -0.972 0.964 2.080 6.19e-02 32.80
## 9 9 0.945 -0.973 0.948 1.640 2.88e-02 28.70
## 10 10 0.935 -0.985 0.925 1.320 1.44e-02 25.40
## 11 12 0.939 -0.981 0.926 0.904 3.75e-03 20.30
## 12 14 0.942 -1.020 0.930 0.649 9.57e-04 16.60
## 13 16 0.931 -1.050 0.913 0.482 2.68e-04 13.80
## 14 18 0.943 -1.050 0.928 0.368 7.86e-05 11.60
## 15 20 0.951 -1.060 0.938 0.287 2.30e-05 9.94
## pickSoftThreshold: will use block size 1128.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 1128 of 1128
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.628 -1.070 0.833 150.000 1.36e+02 328.0
## 2 2 0.920 -1.260 0.900 40.300 2.80e+01 162.0
## 3 3 0.894 -1.230 0.888 16.300 7.32e+00 104.0
## 4 4 0.895 -1.160 0.930 8.600 2.29e+00 75.5
## 5 5 0.892 -1.080 0.961 5.350 8.38e-01 59.1
## 6 6 0.895 -1.020 0.958 3.700 3.52e-01 48.7
## 7 7 0.895 -0.990 0.938 2.740 1.56e-01 41.5
## 8 8 0.908 -0.952 0.950 2.130 7.02e-02 36.0
## 9 9 0.929 -0.934 0.970 1.700 3.39e-02 31.6
## 10 10 0.924 -0.931 0.938 1.400 1.63e-02 28.1
## 11 12 0.945 -0.910 0.951 0.987 3.75e-03 22.7
## 12 14 0.921 -0.941 0.907 0.728 9.66e-04 18.8
## 13 16 0.959 -0.935 0.950 0.554 2.76e-04 15.8
## 14 18 0.935 -0.944 0.918 0.431 7.57e-05 13.5
## 15 20 0.924 -0.971 0.902 0.342 2.10e-05 11.6
The pipeline estimated the optimal power parameter = 7. Using it, we can start building the two co-regulation networks and their associated modules:
# generate the proetin modules
tf <- scan("ProteomicsData/TFs.txt", what = "character")
ef <- scan("ProteomicsData/Epigenetic_factors.txt", what = "characdter")
wgcnaModules <- WGCNAmodules(wgcnaData = wgcnaData, TFs = tf, EFs = ef)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..done.
## ..done.
## mergeCloseModules: Merging modules whose distance is less than 0.2
## multiSetMEs: Calculating module MEs.
## Working on set 1 ...
## moduleEigengenes: Calculating 3 module eigengenes in given set.
## Calculating new MEs...
## multiSetMEs: Calculating module MEs.
## Working on set 1 ...
## moduleEigengenes: Calculating 3 module eigengenes in given set.
## mergeCloseModules: Merging modules whose distance is less than 0.2
## multiSetMEs: Calculating module MEs.
## Working on set 1 ...
## moduleEigengenes: Calculating 2 module eigengenes in given set.
## Calculating new MEs...
## multiSetMEs: Calculating module MEs.
## Working on set 1 ...
## moduleEigengenes: Calculating 2 module eigengenes in given set.
# HF protein modules
head(wgcnaModules$module_data_HF)
## SomaId TargetFullName
## SL002508:IL18BP SL002508 Interleukin-18-binding_protein
## SL004016:CXCL16 SL004016 C-X-C_motif_chemokine_16
## SL009412:DKK3 SL009412 Dickkopf-related_protein_3
## SL008810:NEGR1 SL008810 Neuronal_growth_regulator_1
## SL005196:LSAMP SL005196 Limbic_system-associated_membrane_protein
## SL009349:FSTL1 SL009349 Follistatin-related_protein_1
## Target UniProt EntrezGeneID EntrezGeneSymbol Flag isTF
## SL002508:IL18BP IL-18_BPa O95998 10068 IL18BP PASS 0
## SL004016:CXCL16 CXCL16,_soluble Q9H2A7 58191 CXCL16 PASS 0
## SL009412:DKK3 DKK3 Q9UBP4 27122 DKK3 PASS 0
## SL008810:NEGR1 NEGR1 Q7Z3B1 257194 NEGR1 PASS 0
## SL005196:LSAMP LSAMP Q13449 4045 LSAMP PASS 0
## SL009349:FSTL1 FSTL1 Q12841 11167 FSTL1 PASS 0
## is.Epig Module Cor CorSig
## SL002508:IL18BP 0 #581845_cor.Gene -0.53032037 1.617009e-14
## SL004016:CXCL16 0 #581845_cor.Gene -0.52258757 4.494488e-14
## SL009412:DKK3 0 #581845_cor.Gene -0.50958958 2.366226e-13
## SL008810:NEGR1 0 #581845_cor.Gene -0.47893843 9.099319e-12
## SL005196:LSAMP 0 #581845_cor.Gene -0.47619299 1.240037e-11
## SL009349:FSTL1 0 #581845_cor.Gene -0.46504952 4.235966e-11
## Connectivity TargetFullName.1 EntrezGeneID.1 ModuleName
## SL002508:IL18BP 0.0824199544642018 --- --- ModAnnot_HF1
## SL004016:CXCL16 0.057539649687923 --- --- ModAnnot_HF1
## SL009412:DKK3 0.0464891842044999 --- --- ModAnnot_HF1
## SL008810:NEGR1 0.07045264757201 --- --- ModAnnot_HF1
## SL005196:LSAMP 0.0521577814756525 --- --- ModAnnot_HF1
## SL009349:FSTL1 0.0572776396950279 --- --- ModAnnot_HF1
# Control protein modules
head(wgcnaModules$module_data_Control)
## SomaId TargetFullName
## SL008486:LGALS9 SL008486 Galectin-9
## SL012517:RSPO4 SL012517 R-spondin-4
## SL004248:GDNF SL004248 Glial_cell_line-derived_neurotrophic_factor
## SL011448:TNFRSF4 SL011448 Tumor_necrosis_factor_receptor_superfamily_member_4
## SL001802:IFNG SL001802 Interferon_gamma
## SL002517:TNF SL002517 Tumor_necrosis_factor
## Target UniProt EntrezGeneID EntrezGeneSymbol Flag isTF is.Epig
## SL008486:LGALS9 LEG9 O00182 3965 LGALS9 PASS 0 0
## SL012517:RSPO4 RSPO4 Q2I0M5 343637 RSPO4 PASS 0 0
## SL004248:GDNF GDNF P39905 2668 GDNF PASS 0 0
## SL011448:TNFRSF4 TNR4 P43489 7293 TNFRSF4 PASS 0 0
## SL001802:IFNG IFN-g P01579 3458 IFNG PASS 0 0
## SL002517:TNF TNF-a P01375 7124 TNF PASS 0 0
## Module Cor CorSig Connectivity
## SL008486:LGALS9 #A2E4ED_cor.Gene 0.7872894322 5.309984e-54 0.0482001294030122
## SL012517:RSPO4 #A2E4ED_cor.Gene 0.7589565065 4.113900e-48 0.0459888884934919
## SL004248:GDNF #A2E4ED_cor.Gene 0.7446536119 1.936266e-45 0.101232304453775
## SL011448:TNFRSF4 #A2E4ED_cor.Gene 0.7370502018 4.328097e-44 0.0517663046295784
## SL001802:IFNG #A2E4ED_cor.Gene 0.7212033993 2.011265e-41 0.037901995788316
## SL002517:TNF #A2E4ED_cor.Gene 0.7158598012 1.450797e-40 0.0588539951482277
## TargetFullName.1 EntrezGeneID.1 ModuleName
## SL008486:LGALS9 --- --- ModAnnot_Control1
## SL012517:RSPO4 --- --- ModAnnot_Control1
## SL004248:GDNF --- --- ModAnnot_Control1
## SL011448:TNFRSF4 --- --- ModAnnot_Control1
## SL001802:IFNG --- --- ModAnnot_Control1
## SL002517:TNF --- --- ModAnnot_Control1
We identified 3 HF and 2 Control modules:
# HF modules
table(wgcnaModules$module_data_HF$ModuleName)
##
## ModAnnot_HF1 ModAnnot_HF2 ModAnnot_HF3
## 153 313 662
# Control modules
table(wgcnaModules$module_data_Control$ModuleName)
##
## ModAnnot_Control1 ModAnnot_Control2
## 824 304
# merge and save
mm <- match(rownames(wgcnaModules$module_data_HF), rownames(wgcnaModules$module_data_Control))
wgcnaModules$module_data_Control <- wgcnaModules$module_data_Control[mm, ]
merged_modules <- cbind(wgcnaModules$module_data_HF[,1:9],
Corr_HF = wgcnaModules$module_data_HF[,11],
Corr_Control = wgcnaModules$module_data_Control[,11],
CorrSig_HF = wgcnaModules$module_data_HF[,12],
CorrSig_Control = wgcnaModules$module_data_Control[,12],
Connectivity_HF = wgcnaModules$module_data_HF[,13],
Connectivity_Control = wgcnaModules$module_data_Control[,13],
ModuleName_HF = wgcnaModules$module_data_HF[,16],
ModuleName_Control = wgcnaModules$module_data_Control[,16])
head(merged_modules)
## SomaId TargetFullName
## SL002508:IL18BP SL002508 Interleukin-18-binding_protein
## SL004016:CXCL16 SL004016 C-X-C_motif_chemokine_16
## SL009412:DKK3 SL009412 Dickkopf-related_protein_3
## SL008810:NEGR1 SL008810 Neuronal_growth_regulator_1
## SL005196:LSAMP SL005196 Limbic_system-associated_membrane_protein
## SL009349:FSTL1 SL009349 Follistatin-related_protein_1
## Target UniProt EntrezGeneID EntrezGeneSymbol Flag isTF
## SL002508:IL18BP IL-18_BPa O95998 10068 IL18BP PASS 0
## SL004016:CXCL16 CXCL16,_soluble Q9H2A7 58191 CXCL16 PASS 0
## SL009412:DKK3 DKK3 Q9UBP4 27122 DKK3 PASS 0
## SL008810:NEGR1 NEGR1 Q7Z3B1 257194 NEGR1 PASS 0
## SL005196:LSAMP LSAMP Q13449 4045 LSAMP PASS 0
## SL009349:FSTL1 FSTL1 Q12841 11167 FSTL1 PASS 0
## is.Epig Corr_HF Corr_Control CorrSig_HF CorrSig_Control
## SL002508:IL18BP 0 -0.53032037 0.4307792680 1.617009e-14 1.019809e-12
## SL004016:CXCL16 0 -0.52258757 0.2245546505 4.494488e-14 3.455239e-04
## SL009412:DKK3 0 -0.50958958 0.2195649562 2.366226e-13 4.705256e-04
## SL008810:NEGR1 0 -0.47893843 0.3431590868 9.099319e-12 2.567341e-08
## SL005196:LSAMP 0 -0.47619299 0.2652653578 1.240037e-11 2.141807e-05
## SL009349:FSTL1 0 -0.46504952 0.2911335354 4.235966e-11 2.841307e-06
## Connectivity_HF Connectivity_Control ModuleName_HF
## SL002508:IL18BP 0.0824199544642018 0.0490480657575382 ModAnnot_HF1
## SL004016:CXCL16 0.057539649687923 0.0294019307130613 ModAnnot_HF1
## SL009412:DKK3 0.0464891842044999 0.0330780010007111 ModAnnot_HF1
## SL008810:NEGR1 0.07045264757201 0.0551180917312804 ModAnnot_HF1
## SL005196:LSAMP 0.0521577814756525 0.0416699951520157 ModAnnot_HF1
## SL009349:FSTL1 0.0572776396950279 0.044316466702459 ModAnnot_HF1
## ModuleName_Control
## SL002508:IL18BP ModAnnot_Control1
## SL004016:CXCL16 ModAnnot_Control1
## SL009412:DKK3 ModAnnot_Control1
## SL008810:NEGR1 ModAnnot_Control1
## SL005196:LSAMP ModAnnot_Control1
## SL009349:FSTL1 ModAnnot_Control1
# write.table(merged_modules, "ProteomicsData/Modules.txt", sep = "\t")
Further analysis showed that module Control 1 is split into HF1 (149 proteins; ~18.1%) and HF3 (646 proteins; ~78.4%). Module Control 2 is very similar to HF2 (284 proteins; ~93.4%):
# the proteins of each module
c1 <- rownames(wgcnaModules$module_data_Control)[wgcnaModules$module_data_Control$ModuleName == "ModAnnot_Control1"]
c2 <- rownames(wgcnaModules$module_data_Control)[wgcnaModules$module_data_Control$ModuleName == "ModAnnot_Control2"]
hf1 <- rownames(wgcnaModules$module_data_HF)[wgcnaModules$module_data_HF$ModuleName == "ModAnnot_HF1"]
hf2 <- rownames(wgcnaModules$module_data_HF)[wgcnaModules$module_data_HF$ModuleName == "ModAnnot_HF2"]
hf3 <- rownames(wgcnaModules$module_data_HF)[wgcnaModules$module_data_HF$ModuleName == "ModAnnot_HF3"]
# intersection of Control 1 with all HF modules
length(intersect(c1, hf1)) / length(c1)
## [1] 0.1808252
length(intersect(c1, hf2)) / length(c1)
## [1] 0.03519417
length(intersect(c1, hf3)) / length(c1)
## [1] 0.7839806
# intersection of Control 2 with all HF modules
length(intersect(c2, hf1)) / length(c2)
## [1] 0.01315789
length(intersect(c2, hf2)) / length(c2)
## [1] 0.9342105
length(intersect(c2, hf3)) / length(c2)
## [1] 0.05263158
We applied the WGCNA pipeline of Fuller et al. Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007;18:463-72 to identify protein groups with statistically significant network connectivity and differential expression between HF and Control. Our strategy is as follows:
Record the HF and Controlgroup connectivities, \(K_{HF}\) and \(K_{Control}\), the connectivity difference \(DiffK = K_{HF} - K_{Control}\) and the limma t-test statistics and FDRs of the 1128 QC-passed proteins of CDCS.
Permute the HF and Control labels in order to generate new random patient groups.
Run the WGCNA analysis to estimate the permuted group connectivities \(K^*_{HF}\) and \(K^*_{Control}\) and calculate the connectivity difference \(DiffK^* = K^*_{HF} - K^*_{Control}\).
Iterate B = 50,000 times.
Evaluate the connectivity \(DiffK\) significance by the resampling p-value: \(P = #(DiffK^* > DiffK) / B\).
# collect the diffk = difference in connectivity (HF - Control)
diffk <- cbind(colnames(wgcnaData$wgcnaData_HF$Data), wgcnaData$DiffK)
# calculate the permutation statistics B = 50,000 times
stats_star <- wgcnaStatsFromPermutation(wgcnaData = wgcnaData, B = 50000)
# evaluate the differential co-expression networks
dina <- summariseDiNA(wgcnaData = wgcnaData,
wgcnaModules = wgcnaModules,
DiffK = diffk,
DEs = DEs_CDCS,
de.fdr.cut = 0.05,
diffk.cut = 0.1,
diffk.fdr.cut = 0.05,
permStats = stats_star[[1]])
# store the results
# write.table(dina, "ProteomicsData/DiNA.txt", sep = "\t")
This approach extracts the differentially expressed proteins and the proteins that show differential connectivity between HF and Control at FDR = 5%. The significant proteins are the following:
# load the pre-calculated data (several hours of computing)
dina <- read.table("ProteomicsData/DiNA.txt", sep = "\t", comment.char = "!")
# the significant proteins
dina_signf <- dina[dina$DiNA_significance == "Y", ]
dina_signf
## SomaId
## SL004557:CD59 SL004557
## SL008631:DSC2 SL008631
## SL001777:CST3 SL001777
## SL004140:EFNA4 SL004140
## SL004141:EFNA5 SL004141
## SL001992:TNFRSF1A SL001992
## SL005172:IGFBP6 SL005172
## SL005193:JAM2 SL005193
## SL005194:JAM3 SL005194
## SL001800:TNFRSF1B SL001800
## SL004151:IL15RA SL004151
## SL000283:B2M SL000283
## SL006119:TFF3 SL006119
## SL002654:EPHA2 SL002654
## SL005213:RELT SL005213
## TargetFullName
## SL004557:CD59 CD59_glycoprotein
## SL008631:DSC2 Desmocollin-2
## SL001777:CST3 Cystatin-C
## SL004140:EFNA4 Ephrin-A4
## SL004141:EFNA5 Ephrin-A5
## SL001992:TNFRSF1A Tumor_necrosis_factor_receptor_superfamily_member_1A
## SL005172:IGFBP6 Insulin-like_growth_factor-binding_protein_6
## SL005193:JAM2 Junctional_adhesion_molecule_B
## SL005194:JAM3 Junctional_adhesion_molecule_C
## SL001800:TNFRSF1B Tumor_necrosis_factor_receptor_superfamily_member_1B
## SL004151:IL15RA Interleukin-15_receptor_subunit_alpha
## SL000283:B2M Beta-2-microglobulin
## SL006119:TFF3 Trefoil_factor_3
## SL002654:EPHA2 Ephrin_type-A_receptor_2
## SL005213:RELT Tumor_necrosis_factor_receptor_superfamily_member_19L
## Target UniProt EntrezGeneID EntrezGeneSymbol
## SL004557:CD59 CD59 P13987 966 CD59
## SL008631:DSC2 DSC2 Q02487 1824 DSC2
## SL001777:CST3 Cystatin_C P01034 1471 CST3
## SL004140:EFNA4 Ephrin-A4 P52798 1945 EFNA4
## SL004141:EFNA5 Ephrin-A5 P52803 1946 EFNA5
## SL001992:TNFRSF1A TNF_sR-I P19438 7132 TNFRSF1A
## SL005172:IGFBP6 IGFBP-6 P24592 3489 IGFBP6
## SL005193:JAM2 JAM-B P57087 58494 JAM2
## SL005194:JAM3 JAM-C Q9BX67 83700 JAM3
## SL001800:TNFRSF1B TNF_sR-II P20333 7133 TNFRSF1B
## SL004151:IL15RA IL-15_Ra Q13261 3601 IL15RA
## SL000283:B2M b2-Microglobulin P61769 567 B2M
## SL006119:TFF3 TFF3 Q07654 7033 TFF3
## SL002654:EPHA2 Epithelial_cell_kinase P29317 1969 EPHA2
## SL005213:RELT RELT Q969Z4 84957 RELT
## Flag isTF is.Epig Cor CorSig ModuleName
## SL004557:CD59 PASS 0 0 -0.37699986 1.684187e-07 ModAnnot_HF1
## SL008631:DSC2 PASS 0 0 -0.31974845 1.145474e-05 ModAnnot_HF1
## SL001777:CST3 PASS 0 0 -0.12333575 9.809936e-02 ModAnnot_HF1
## SL004140:EFNA4 PASS 0 0 -0.36561935 4.167091e-07 ModAnnot_HF1
## SL004141:EFNA5 PASS 0 0 -0.33355510 4.463717e-06 ModAnnot_HF1
## SL001992:TNFRSF1A PASS 0 0 -0.24723616 7.924368e-04 ModAnnot_HF1
## SL005172:IGFBP6 PASS 0 0 -0.14328861 5.431233e-02 ModAnnot_HF1
## SL005193:JAM2 PASS 0 0 -0.17824612 1.636377e-02 ModAnnot_HF1
## SL005194:JAM3 PASS 0 0 0.08706781 2.438260e-01 ModAnnot_HF2
## SL001800:TNFRSF1B PASS 0 0 -0.30113489 3.799977e-05 ModAnnot_HF1
## SL004151:IL15RA PASS 0 0 -0.32955902 5.891338e-06 ModAnnot_HF1
## SL000283:B2M PASS 0 0 -0.11780828 1.142277e-01 ModAnnot_HF1
## SL006119:TFF3 PASS 0 0 -0.30672398 2.673419e-05 ModAnnot_HF1
## SL002654:EPHA2 PASS 0 0 -0.33424636 4.252822e-06 ModAnnot_HF1
## SL005213:RELT PASS 0 0 -0.35547883 9.075413e-07 ModAnnot_HF1
## logFC Ttest DE_PV DE_FDR DiffK
## SL004557:CD59 0.1351539 0.1351539 6.431785e-05 1.199069e-03 0.1739577
## SL008631:DSC2 0.2210540 0.2210540 1.178876e-07 5.917050e-06 0.1878401
## SL001777:CST3 0.2552607 0.2552607 1.443106e-11 9.416268e-09 0.1541372
## SL004140:EFNA4 0.2116332 0.2116332 2.088071e-08 1.362466e-06 0.1959876
## SL004141:EFNA5 0.2069690 0.2069690 6.266176e-07 2.336389e-05 0.1453081
## SL001992:TNFRSF1A 0.2325912 0.2325912 1.802758e-07 8.112410e-06 0.1989524
## SL005172:IGFBP6 0.1332245 0.1332245 2.818557e-04 3.678217e-03 0.1192140
## SL005193:JAM2 0.1420750 0.1420750 8.817774e-07 3.110053e-05 0.1522122
## SL005194:JAM3 0.1514172 0.1514172 5.155150e-04 5.701246e-03 -0.3265702
## SL001800:TNFRSF1B 0.1388130 0.1388130 6.963250e-04 6.781374e-03 0.1440474
## SL004151:IL15RA 0.1743670 0.1743670 7.361769e-05 1.334321e-03 0.1196379
## SL000283:B2M 0.2929166 0.2929166 3.352110e-09 4.910032e-07 0.1544569
## SL006119:TFF3 0.3352670 0.3352670 4.552645e-11 1.980400e-08 0.1637671
## SL002654:EPHA2 0.2535444 0.2535444 1.733899e-08 1.257077e-06 0.1724334
## SL005213:RELT 0.2299626 0.2299626 2.279386e-07 9.595480e-06 0.2201945
## DiffK_PV DiffK_FDR Comparison DiNA_significance
## SL004557:CD59 0.00014 0.01611429 HF-Control Y
## SL008631:DSC2 0.00018 0.01611429 HF-Control Y
## SL001777:CST3 0.00038 0.01863652 HF-Control Y
## SL004140:EFNA4 0.00056 0.02423111 HF-Control Y
## SL004141:EFNA5 0.00032 0.01863652 HF-Control Y
## SL001992:TNFRSF1A 0.00020 0.01611429 HF-Control Y
## SL005172:IGFBP6 0.00066 0.02567172 HF-Control Y
## SL005193:JAM2 0.00000 0.00000000 HF-Control Y
## SL005194:JAM3 0.00120 0.03760000 HF-Control Y
## SL001800:TNFRSF1B 0.00016 0.01611429 HF-Control Y
## SL004151:IL15RA 0.00038 0.01863652 HF-Control Y
## SL000283:B2M 0.00026 0.01833000 HF-Control Y
## SL006119:TFF3 0.00016 0.01611429 HF-Control Y
## SL002654:EPHA2 0.00010 0.01611429 HF-Control Y
## SL005213:RELT 0.00002 0.01128000 HF-Control Y
# number of significant proteins by module
tab <- table(dina$ModuleName, dina$DiNA_significance)
tab
##
## N Y
## ModAnnot_HF1 139 14
## ModAnnot_HF2 312 1
## ModAnnot_HF3 662 0
# enrichment of HF1
mat <- cbind(tab[1, ], apply(tab[2:3, ], 2, sum))
fisher.test(mat)
##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 5.554e-12
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.0002469349 0.0684075100
## sample estimates:
## odds ratio
## 0.01026009
The analysis highlighted 15 proteins that are both differentially expressed and differentially connected (DiNA). All but one of them come from the HF1 module. The HF1 module is statistically over-represented with significant proteins (Fisher test P = 5.554e-12). The plot below shows these proteins in blue across the logFC and DiffK axes:
plotDiNA(DiNAdata = dina)
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
In this paragraph, we will briefly show the integration of the proteomic and the single-cell RNA-seq results. We will use the processed data of 3 single-cell and 1 single-nuclei RNA-seq experiments. The processing has been described in our paper. The raw data can be downloaded from the respective publications.
First, we load the normalised expression profiles of the 440 Sham and 390 MI high-quality cardiac fibroblasts from 3 Sham and 3 MI mice (nonMyo_expression), the design of the experiment (nonMyo_design) and the differentially expressed genes of the Sham vs MI comparison estimated by Seurat (nonMyo_de). We also consider the differentially expressed proteins and the LVEF correlations of the CDCS data analysis.
# load the RNA-seq data
load("nonMyo.RData")
# contents of the .Rda file
names(nonMyo)
## [1] "nonMyo_expression" "nonMyo_design" "nonMyo_de"
# stored differentially expressed proteins
p_DE <- read.table("ProteomicsData/cands_CDCSonly.txt", sep = "\t", header = T)
# the correlations are read from the 'cors' variable above
The function below intersects the differentially expressed proteins and all genes from the non-myocyte sincle-cell RNA-seq experiment and plots the latter expression profiles in a heatmap. Gene names in green are differentially expressed in both datasets (FDR<10% and |logFC| >1 in the single-cell experiment), gene names in khaki are differentially expressed in the single-cell dataset only with the same logFC signs in the two datasets and gene names in gray have different logFC signs in the two datasets independently of differential expression. The right-hand side plots the somalogic logFCs (HF - Control) highlighting in cyan the differentially expressed proteins at FDR < 0.1%.
# Integration of plasma proteomics and sc fibroblasts
nonmyo <- scrna_nonmyocytes(scExpr = nonMyo$nonMyo_expression,
scDesign = nonMyo$nonMyo_design,
scDE = nonMyo$nonMyo_de,
pDE = p_DE,
lvefStats = cors)
In the same fashion, we analyse the other 3 available datasets. We load and integrate the TPM normalised expression profiles of mouse cardiomyocytes from 85 Sham and 100 TAC high-quality single nuclei (snCardiomyocytes_expression), the design of the experiment (snCardiomyocytes_design) and the differentially expressed genes of Sham vs TAC by edgeR (snCardiomyocytes_de).
# load the RNA-seq data
load("snMyo.RData")
# contents of the .Rda file
names(snMyo)
## [1] "snCardiomyocytes_expression" "snCardiomyocytes_design"
## [3] "snCardiomyocytes_de"
# Integration of plasma proteomics and sn myocytes
myo1 <- scrna_myocytes(scExpr = snMyo$snCardiomyocytes_expression,
scDesign = snMyo$snCardiomyocytes_design,
scDE = snMyo$snCardiomyocytes_de,
pDE = p_DE,
lvefStats = cors)
We load and integrate the TPM normalised expression profiles the time course mouse myocytes dataset of 88 Sham, 69 day-3 TAC, 83 day-7 TAC and 73 4-weeks TAC. As differentially expressed, we consider those genes with FDR < 10% and |logFC| > 1 in the Sham vs TAC of at least one time point.
# load the RNA-seq data
load("tsMyo.RData")
# contents of the .Rda file
names(tsMyo)
## [1] "tsCardiomyocytes_expression" "tsCardiomyocytes_design"
## [3] "tsCardiomyocytes_de"
# Integration of plasma proteomics and sc myocytes of the time-series experiment
myo2 <- scrna_myocytes_time(scExpr = tsMyo$tsCardiomyocytes_expression,
scDesign = tsMyo$tsCardiomyocytes_design,
scDE = tsMyo$tsCardiomyocytes_de,
pDE = p_DE,
lvefStats = cors)
Finally, we load and integrate the human myocytes dataset of 71 Control and 488 DCM single cells:
# load the RNA-seq data
load("hMyo.RData")
# contents of the .Rda file
names(hMyo)
## [1] "hCardiomyocytes_expression" "hCardiomyocytes_design"
## [3] "hCardiomyocytes_de"
# Integration of plasma proteomics and sc human myocytes
hmyo <- scrna_humanMyocytes(scExpr = hMyo$hCardiomyocytes_expression,
scDesign = hMyo$hCardiomyocytes_design,
scDE = hMyo$hCardiomyocytes_de,
pDE = p_DE,
lvefStats = cors)
In this paragraph we will summarise the main findings of this study. We analysed the data of two protein cohorts and 4 single-cell RNA-seq datasets and integrated the findings in order to prioritise a list of proteins for further experimental validations. In CDCS we found 1128 proteins that passed the QC criteria:
# QC in CDCS
table(data_CDCS$Annotation$Flag)
##
## FLAG PASS
## 177 1128
The differential expression analysis identified 212 significant proteins among them:
# number of differentially expressed proteins in CDCS
length(DEproteins_in_cdcs)
## [1] 212
We intersected these results with those of LVEF correlation analysis to find 96 differentially expressed and LVEF significantly correlated proteins:
# number of differentially expressed and LVEF significantly correlated proteins in CDCS
leg1_lvef <- intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs)
length(leg1_lvef)
## [1] 96
Finally, we run WGCNA analysis to find a robust set of significant proteins that exhibit all the above criteria and have unique co-regulation patterns in HF compared to Control. We ended up with 12 candidates:
# number of differentially expressed, LVEF significantly correlated & HF uniquely co-regulated proteins in CDCS
leg1_DiNA <- intersect(intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs),
rownames(dina_signf)[dina_signf$ModuleName == "ModAnnot_HF1"])
length(leg1_DiNA)
## [1] 12
We integrated the above results with those of the differential expression analysis of four independent single-cell RNA-seq datasets. In the mouse cardiac non-myocytes dataset we found:
Green set: 10 differentially expressed genes whose protein counterparts were also differentially expressed. Only 6 of them passed the CDCS QC.
Khaki set: 28 non-differentially expressed genes with differentially expressed protein counterparts and same logFC direction. The single-cell logFCs are constrained to be > 1 in absolute value. All of them passed the CDCS QC.
Gray set: 8 genes with inconsistent logFC direction compared to the CDCS.
We considered as consisten the green and khaki candidates that passed the CDCS QC. We grouped together the consistent results and found that they are enriched among the 467 differentially expressed genes of the non-myocytes study (Fisher test P = 1.253e-3):
# agreement between CDCS and non-myocyte scRNA-seq differential expression
table(nonmyo[[1]][, 4], nonmyo[[1]][, 5])
##
## FLAG PASS
## gray 1 7
## green 4 6
## khaki 0 28
# frequency table and enrichment of consistent findings
nonmyo[[3]]
## Consistent Inconsistent
## Common 34 12
## Non-common 433 455
fisher.test(nonmyo[[3]])
##
## Fisher's Exact Test for Count Data
##
## data: nonmyo[[3]]
## p-value = 0.001253
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.478916 6.393639
## sample estimates:
## odds ratio
## 2.974047
In the same way, we integrated the results of CDCS and the other single-cell datasets. In the mouse cardiomyocytes of TAC vs Control we found that 13 genes of the Green set passed CDCS’s QC and a highly significant enrichment of consistent hits among the 713 differentially expressed genes (Fisher test P = 6.807e-09).
# agreement between CDCS and myocyte scRNA-seq differential expression
table(myo1[[1]][, 4], myo1[[1]][, 5])
##
## FLAG PASS
## gray 1 3
## green 3 13
## khaki 4 25
# frequency table and enrichment of consistent findings
myo1[[3]]
## Consistent Inconsistent
## Common 38 3
## Non-common 675 710
fisher.test(myo1[[3]])
##
## Fisher's Exact Test for Count Data
##
## data: myo1[[3]]
## p-value = 6.807e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 4.191496 67.579692
## sample estimates:
## odds ratio
## 13.30144
In the mouse cardiomyocytes TAC vs Control time-course experiment we found 10 Green set genes that passed CDCS’s QC and a highly significant enrichment of consistent hits among the 3022 differentially expressed genes of any of the pairwise comparisons (Fisher test P = 1.809e-10). In addition, we found that such a highly significant enrichment is reproduced among the LVEF associated proteins (Fisher test P = 1.42e-5).
# agreement between CDCS and myocyte scRNA-seq differential expression
table(myo2[[1]][, 4], myo2[[1]][, 5])
##
## FLAG PASS
## gray 1 5
## green 4 10
## khaki 5 40
# frequency table and enrichment of consistent findings
myo2[[3]]
## Consistent Inconsistent
## Common 50 5
## Non-common 2972 3017
fisher.test(myo2[[3]])
##
## Fisher's Exact Test for Count Data
##
## data: myo2[[3]]
## p-value = 1.809e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 4.062894 32.665758
## sample estimates:
## odds ratio
## 10.14807
Finally, in the human cardiomyocytes DCM vs Control study we found 11 Green set genes that passed CDCS’s QC and a highly significant enrichment of consistent hits among the 390 differentially expressed genes of the study (Fisher test P = 3.487e-08):
# agreement between CDCS and myocyte scRNA-seq differential expression
table(hmyo[[1]][, 4], hmyo[[1]][, 5])
##
## FLAG PASS
## gray 2 5
## green 0 11
## khaki 4 29
# frequency table and enrichment of consistent findings
hmyo[[3]]
## Consistent Inconsistent
## Common 40 5
## Non-common 350 385
fisher.test(hmyo[[3]])
##
## Fisher's Exact Test for Count Data
##
## data: hmyo[[3]]
## p-value = 3.487e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 3.411695 28.838281
## sample estimates:
## odds ratio
## 8.780463
We collected the results of all-single-cell analysis and found 128 unique candidates (127 if we count NPPB only once with respect to gene names) of which 24 (23 if we count NPPB only once with respect to gene names) were significantly correlated with LVEF.
# green/khaki QC-passed genes found in at least one dataset
set1 <- as.character(nonmyo[[1]][nonmyo[[1]][, 4] == "green" & nonmyo[[1]][, 5] == "PASS" |
nonmyo[[1]][, 4] == "khaki" & nonmyo[[1]][, 5] == "PASS", 1])
set1 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[,6]),
GeneNames(toupper(set1)), nomatch = 0) > 0 ]
set2 <- as.character(myo1[[1]][myo1[[1]][, 4] == "green" & myo1[[1]][, 5] == "PASS" |
myo1[[1]][, 4] == "khaki" & myo1[[1]][, 5] == "PASS", 1])
set2 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]),
GeneNames(toupper(set2)), nomatch = 0) > 0 ]
set3 <- as.character(myo2[[1]][myo2[[1]][, 4] == "green" & myo2[[1]][, 5] == "PASS" |
myo2[[1]][, 4] == "khaki" & myo2[[1]][, 5] == "PASS", 1])
set3 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]),
toupper(set3), nomatch = 0) > 0 ]
set4 <- as.character(hmyo[[1]][hmyo[[1]][, 4] == "green" & hmyo[[1]][, 5] == "PASS" |
hmyo[[1]][, 4] == "khaki" & hmyo[[1]][, 5] == "PASS", 1])
set4 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]),
set4, nomatch = 0) > 0 ]
consistent_set <- unique(c(toupper(set1),
toupper(set2),
toupper(set3),
set4))
length(consistent_set)
## [1] 128
# intersection with LVEF
leg2_consistent_lvef <- intersect(consistent_set, LVEFproteins_in_cdcs)
length(leg2_consistent_lvef)
## [1] 24
Of those, we prioritised for further analysis and experimental validation only the 30 green candidates (differentially expressed in both proteomic and transcriptomic data), of which 15 have protein counterparts that are significantly correlated with LVEF:
# green QC-passed genes found in at least one dataset
set1 <- as.character(nonmyo[[1]][nonmyo[[1]][, 4] == "green" & nonmyo[[1]][, 5] == "PASS", 1])
set1 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[,6]),
GeneNames(toupper(set1)), nomatch = 0) > 0 ]
set2 <- as.character(myo1[[1]][myo1[[1]][, 4] == "green" & myo1[[1]][, 5] == "PASS", 1])
set2 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]),
GeneNames(toupper(set2)), nomatch = 0) > 0 ]
set3 <- as.character(myo2[[1]][myo2[[1]][, 4] == "green" & myo2[[1]][, 5] == "PASS", 1])
set3 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]),
toupper(set3), nomatch = 0) > 0 ]
set4 <- as.character(hmyo[[1]][hmyo[[1]][, 4] == "green" & hmyo[[1]][, 5] == "PASS", 1])
set4 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]),
set4, nomatch = 0) > 0 ]
prioritised_set <- unique(c(toupper(set1),
toupper(set2),
toupper(set3),
set4))
length(prioritised_set)
## [1] 30
# intersection with LVEF
leg2_prioritised <- intersect(prioritised_set, LVEFproteins_in_cdcs)
length(leg2_prioritised)
## [1] 15
To obtain the final list of prioritised proteins for experimental validation, we found common proteins of all pairwise comparisons of three sets:
leg1_prioritised <- intersect(leg1_lvef, rownames(dina)[dina$ModuleName == "ModAnnot_HF1"])
leg1_prioritised
## [1] "SL007696:CD93" "SL000403:COL18A1" "SL007547:HAVCR2"
## [4] "SL005152:RARRES2" "SL018509:RSPO3" "SL006527:EFEMP1"
## [7] "SL010465:MATN2" "SL005357:REG1A" "SL000695:LCN2"
## [10] "SL005430:SECTM1" "SL009400:CHRDL1" "SL004646:LAYN"
## [13] "SL019019:PIANP" "SL004141:EFNA5" "SL001800:TNFRSF1B"
## [16] "SL000466:IGFBP2" "SL000525:MMP7" "SL007306:FAM3B"
## [19] "SL011888:SMOC1" "SL004557:CD59" "SL005230:UNC5C"
## [22] "SL011509:PYY" "SL008486:LGALS9" "SL012561:REG4"
## [25] "SL004151:IL15RA" "SL003190:CCL15" "SL003970:PTH"
## [28] "SL002506:PLAUR" "SL007033:LTBP4" "SL000024:F3"
## [31] "SL007056:BMP10" "SL007274:EFNB2" "SL011448:TNFRSF4"
## [34] "SL004149:IL13RA1" "SL016148:Human-virus" "SL011068:IL17RC"
## [37] "SL009207:DCTN2" "SL006924:AGRP" "SL004458:PI3"
## [40] "SL003915:KLK8" "SL008372:FLRT3" "SL003060:FGFR1"
## [43] "SL000658:GAS1" "SL004140:EFNA4" "SL008099:CAPG"
## [46] "SL006119:TFF3" "SL002785:NPPB" "SL001774:FABP3"
## [49] "SL009324:FSTL3" "SL001777:CST3" "SL007206:THBS2"
## [52] "SL001888:SLPI" "SL000002:VEGFA" "SL003869:GDF15"
## [55] "SL000052:TNNT2" "SL001996:ANGPT2" "SL003320:FIGF"
## [58] "SL002654:EPHA2" "SL001992:TNFRSF1A" "SL000283:B2M"
## [61] "SL005213:RELT" "SL008631:DSC2" "SL003329:CCL14"
leg2_prioritised
## [1] "SL001996:ANGPT2" "SL004149:IL13RA1" "SL007206:THBS2" "SL007033:LTBP4"
## [5] "SL001777:CST3" "SL000283:B2M" "SL000306:NPPB" "SL001774:FABP3"
## [9] "SL002505:NPPA" "SL002785:NPPB" "SL000695:LCN2" "SL007207:THBS4"
## [13] "SL009324:FSTL3" "SL000052:TNNT2" "SL001761:TNNI3"
immaculate_set <- rownames(meta)[meta$Validated == TRUE]
immaculate_set
## [1] "SL002785:NPPB" "SL006119:TFF3" "SL009324:FSTL3"
## [4] "SL006527:EFEMP1" "SL003869:GDF15" "SL003320:FIGF"
## [7] "SL000052:TNNT2" "SL001996:ANGPT2" "SL007056:BMP10"
## [10] "SL000306:NPPB" "SL006610:ADAMTS13" "SL007206:THBS2"
## [13] "SL004646:LAYN" "SL004258:ADIPOQ" "SL005115:SPON1"
## [16] "SL009400:CHRDL1" "SL011888:SMOC1" "SL004060:ECE1"
## [19] "SL003327:CFD" "SL008382:CST5" "SL003324:F10"
## [22] "SL003770:SFRP1" "SL005230:UNC5C" "SL005789:STC1"
## [25] "SL000592:TIMP2" "SL007696:CD93" "SL001761:TNNI3"
## [28] "SL006924:AGRP" "SL005209:NOTCH3" "SL007033:LTBP4"
## [31] "SL010381:DKKL1" "SL002823:SELL" "SL006230:LUM"
## [34] "SL000462:IGFBP1" "SL002505:NPPA" "SL000087:IL6"
The analysis prioritised 25 proteins: 6 common among all sets, 11 common between the CDCS and the IMMACULATE sets, 3 common between the IMMACULATE and the single-cell RNA-seq set and 5 common between the CDCS and the single-cell RNA-seq set. Of them, 7 are differentially expressed in the IMMACULATE cohort at FDR = 10%:
# venn diagram
v <- venn(list(cdcs = leg1_prioritised, scRNAseq = leg2_prioritised, immaculate = immaculate_set))
# proteins of the venn diagram
attributes(v)$intersections
## $immaculate
## [1] "SL006610:ADAMTS13" "SL004258:ADIPOQ" "SL005115:SPON1"
## [4] "SL004060:ECE1" "SL003327:CFD" "SL008382:CST5"
## [7] "SL003324:F10" "SL003770:SFRP1" "SL005789:STC1"
## [10] "SL000592:TIMP2" "SL005209:NOTCH3" "SL010381:DKKL1"
## [13] "SL002823:SELL" "SL006230:LUM" "SL000462:IGFBP1"
## [16] "SL000087:IL6"
##
## $`cdcs:scRNAseq`
## [1] "SL000695:LCN2" "SL004149:IL13RA1" "SL001774:FABP3" "SL001777:CST3"
## [5] "SL000283:B2M"
##
## $`cdcs:immaculate`
## [1] "SL007696:CD93" "SL006527:EFEMP1" "SL009400:CHRDL1" "SL004646:LAYN"
## [5] "SL011888:SMOC1" "SL005230:UNC5C" "SL007056:BMP10" "SL006924:AGRP"
## [9] "SL006119:TFF3" "SL003869:GDF15" "SL003320:FIGF"
##
## $`scRNAseq:immaculate`
## [1] "SL000306:NPPB" "SL002505:NPPA" "SL001761:TNNI3"
##
## $`cdcs:scRNAseq:immaculate`
## [1] "SL007033:LTBP4" "SL002785:NPPB" "SL009324:FSTL3" "SL007206:THBS2"
## [5] "SL000052:TNNT2" "SL001996:ANGPT2"
##
## $cdcs
## [1] "SL000403:COL18A1" "SL007547:HAVCR2" "SL005152:RARRES2"
## [4] "SL018509:RSPO3" "SL010465:MATN2" "SL005357:REG1A"
## [7] "SL005430:SECTM1" "SL019019:PIANP" "SL004141:EFNA5"
## [10] "SL001800:TNFRSF1B" "SL000466:IGFBP2" "SL000525:MMP7"
## [13] "SL007306:FAM3B" "SL004557:CD59" "SL011509:PYY"
## [16] "SL008486:LGALS9" "SL012561:REG4" "SL004151:IL15RA"
## [19] "SL003190:CCL15" "SL003970:PTH" "SL002506:PLAUR"
## [22] "SL000024:F3" "SL007274:EFNB2" "SL011448:TNFRSF4"
## [25] "SL016148:Human-virus" "SL011068:IL17RC" "SL009207:DCTN2"
## [28] "SL004458:PI3" "SL003915:KLK8" "SL008372:FLRT3"
## [31] "SL003060:FGFR1" "SL000658:GAS1" "SL004140:EFNA4"
## [34] "SL008099:CAPG" "SL001888:SLPI" "SL000002:VEGFA"
## [37] "SL002654:EPHA2" "SL001992:TNFRSF1A" "SL005213:RELT"
## [40] "SL008631:DSC2" "SL003329:CCL14"
##
## $scRNAseq
## [1] "SL007207:THBS4"
# the 25 proteins of interest
attributes(v)$intersections[2:5]
## $`cdcs:scRNAseq`
## [1] "SL000695:LCN2" "SL004149:IL13RA1" "SL001774:FABP3" "SL001777:CST3"
## [5] "SL000283:B2M"
##
## $`cdcs:immaculate`
## [1] "SL007696:CD93" "SL006527:EFEMP1" "SL009400:CHRDL1" "SL004646:LAYN"
## [5] "SL011888:SMOC1" "SL005230:UNC5C" "SL007056:BMP10" "SL006924:AGRP"
## [9] "SL006119:TFF3" "SL003869:GDF15" "SL003320:FIGF"
##
## $`scRNAseq:immaculate`
## [1] "SL000306:NPPB" "SL002505:NPPA" "SL001761:TNNI3"
##
## $`cdcs:scRNAseq:immaculate`
## [1] "SL007033:LTBP4" "SL002785:NPPB" "SL009324:FSTL3" "SL007206:THBS2"
## [5] "SL000052:TNNT2" "SL001996:ANGPT2"
# DE in IMMACULATE
intersect(DEproteins_in_immaculate10, unlist(attributes(v)$intersections[2:5]))
## [1] "SL007033:LTBP4" "SL007696:CD93" "SL003320:FIGF" "SL001996:ANGPT2"
## [5] "SL009324:FSTL3" "SL000306:NPPB" "SL002505:NPPA"
We calculated the enrichment of the 6 common proteins of all sets: we generated a 2x2 frequency matrix with the ‘proteins validated in all sets’ (n1), the ‘proteins validated in the two proteomics but not in the RNA-seq’ (n2), the ‘proteins not validated in both proteomics but validated in RNA-seq’ (n3) and the ‘proteins not validated anywhere’ (n4). The results show a significant enrichment with Fisher P-value = 2.25e-4.
# the 6 proteins of interest
n1 <- length(unlist(attributes(v)$intersections[5]))
# only found in proteomics
n2 <- length(unlist(attributes(v)$intersections[c(1, 3, 6)]))
# only found in RNA-seq and one of the two proteomics
n3 <- length(unlist(attributes(v)$intersections[c(2, 4, 7)]))
# found nowhere
n4 <- length(setdiff(rownames(meta)[meta$Validated == FALSE], unlist(attributes(v)$intersections)))
# enrichment
mat <- matrix(c(n1, n2, n3, n4), ncol = 2)
colnames(mat) <- c("Proteomics","Not-proteomics")
rownames(mat) <- c("RNA","Not-RNA")
mat
## Proteomics Not-proteomics
## RNA 6 9
## Not-RNA 68 1024
fisher.test(mat)
##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 0.0002256
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.839098 32.480616
## sample estimates:
## odds ratio
## 9.989364
At the last step of our prioritisation procedure, we examine which of the 25 proteins (found in all sets or at least one of the proteomics and the single-cell RNA-seq) are directly connected to at least one of the DiNA proteins:
# the 25 validated proteins
validated25 <- unlist(attributes(v)$intersections[2:5])
# the DiNA proteins with significant LVEF association
leg1_DiNA
## [1] "SL004141:EFNA5" "SL001800:TNFRSF1B" "SL004557:CD59"
## [4] "SL004151:IL15RA" "SL004140:EFNA4" "SL006119:TFF3"
## [7] "SL001777:CST3" "SL002654:EPHA2" "SL001992:TNFRSF1A"
## [10] "SL000283:B2M" "SL005213:RELT" "SL008631:DSC2"
# the WGCNA edges of HF1
HFedges <- read.table("ProteomicsData/CytoscapeInput_HF_edges_blue.txt", sep = "\t", header = T)
# the final prioritised protein set
finalSet <- integrated_prioritisation(edges = HFedges,
dina = leg1_DiNA,
validated = validated25)
finalSet
## cdcs:scRNAseq4 cdcs:scRNAseq5 cdcs:immaculate9
## "SL001777:CST3" "SL000283:B2M" "SL006119:TFF3"
## cdcs:scRNAseq:immaculate3 cdcs:scRNAseq1 cdcs:immaculate4
## "SL009324:FSTL3" "SL000695:LCN2" "SL004646:LAYN"
## cdcs:immaculate8 cdcs:immaculate1 cdcs:immaculate10
## "SL006924:AGRP" "SL007696:CD93" "SL003869:GDF15"
## cdcs:immaculate11 cdcs:immaculate5 cdcs:scRNAseq2
## "SL003320:FIGF" "SL011888:SMOC1" "SL004149:IL13RA1"
## cdcs:immaculate7 cdcs:immaculate3 cdcs:scRNAseq:immaculate1
## "SL007056:BMP10" "SL009400:CHRDL1" "SL007033:LTBP4"
## cdcs:scRNAseq:immaculate6 cdcs:scRNAseq:immaculate4 cdcs:immaculate6
## "SL001996:ANGPT2" "SL007206:THBS2" "SL005230:UNC5C"
## cdcs:scRNAseq3 cdcs:immaculate2 scRNAseq:immaculate1
## "SL001774:FABP3" "SL006527:EFEMP1" "SL000306:NPPB"
## cdcs:scRNAseq:immaculate5
## "SL000052:TNNT2"
We found 22 proteins which form our final prioritised set. TFF3, CST3 and B2M are among those at the top of the list being cross-cohort validated (differentially expressed in CDCS and validated in IMMACULATE), differentially co-expressed in CDCS and significantly associated to LVEF. This final 22-protein set includes 5 of the 6 proteins were found in all datasets:
# cross-cohort validated, DiNA and LVEF associated
intersect(leg1_DiNA, finalSet)
## [1] "SL006119:TFF3" "SL001777:CST3" "SL000283:B2M"
# proteins found in all datasets
finalSet[grep("cdcs:scRNAseq:immaculate", names(finalSet))]
## cdcs:scRNAseq:immaculate3 cdcs:scRNAseq:immaculate1 cdcs:scRNAseq:immaculate6
## "SL009324:FSTL3" "SL007033:LTBP4" "SL001996:ANGPT2"
## cdcs:scRNAseq:immaculate4 cdcs:scRNAseq:immaculate5
## "SL007206:THBS2" "SL000052:TNNT2"